Data Analytics – Options for Time Series Store

databasenosqlreal time

I'm building an Internet of Things solution that has a number of real time feeds. I need to make a couple of design decisions with regards to the data store and reporting and looking for input.

I have two deployment models, local where a small Linux server captures multiple streams usually a new data point each second and record length is small, basically a fixed format message of about 20 bytes. The other model is cloud based where the transaction rate will be much higher, up to 1000 data points a second.

Concentrating on the local model first, my prototype uses SQLite under MONO and works fine in a traditional RDMS manner. However I'm aware that RDBMS isn't optimal for time series systems and possibly a NOSQL solution might be better.

I'm having a re-think on the SQL approach due to my real time analytics requirement. I want to be able to easily report on SUM, COUNT, AVERAGE, MEAN, SD, MIN, MAX on potentially each of the time series datasets, basically have the system recalculate these for each new data point. Although I haven't done any performance testing, I can see if I have a year or more data the dataset size is going to be pretty large (20 bytes / second over a year is 600MB and I could have up to 30 different time series datasets) and won't be great for real time calculation of averages etc. in normal SQL. I also want to keep the size / cost of the server low so won't be able to add gobs of CPU or memory but disk space should be OK.

The other option I have is that my system can store the delta (changes) to the real time feeds instead of the regular data point, which will save about 80% of the disk space. However I don't know of an easy way to do set calculations like running averages if I am storing deltas (as far as I know not possible with standard SQL statements but I could be wrong).

Given my use case above I have 3 questions:

1) Should I seriously consider a NOSQL data store? Do you have any suggestions on the advantages of a NOSQL store for this model?

2) How I can best do real time set calculations like average without a lot of complexity or hardware? I have platform options of either MONO or NODE on Linux or Windows.

3) How to do efficient real time set calculations (especially averages which will be the majority of calculations) on either SQLite or a NOSQL DB when I'm storing deltas?

Best Answer

NoSQL

For your raw transactions, if the data you are getting is not required to follow a specific format, NoSQL may be a good way to store the data. If it is capable of being stored in a relational data model easily (tables and columns), there are significant speed advantages to using a relational database.

How to do efficient calculations

Aggregate based on a time period

One of the fastest things you can do is to aggregate your data over a time period, storing the results.

Let's assume that your real time feeds are by the minute. That would be 1,440 data points per day, 10,080 per week, approximately 43,000 per month, or 525,600 per year. Now, cut that down to one record per day. That would be a savings of 525,236 records per year. It is also over half a million rows that you don't have to aggregate every time you need to perform a "real time" calculation. You will have to determine what the smallest time period is for a meaningful aggregation that isn't excessively large or small.

A concrete example will help to illustrate this principle: stock market trades by ticker symbol. Every trade comes across the wire with the ticker symbol, the time of the trade, price per share, and quantity of shares traded. There may be other data in the feed, but this will suffice for the example. Knowing the high (max), low (min), and average (mean) of all the trades by day will show certain types of trends. It is not necessary to see the individual transactions when you are looking for a trend that is days, months, or even years in the making. Once you have aggregated that data for the previous day's raw transactions, you don't ever need to recalculate those items again. Store the aggregated data in another table. Then if you want to determine an average (mean) price for a specific ticker over a period of time (week, month, quarter, etc.), it is a fairly simple SQL statement with a date range.

Weighted aggregation

Another very fast method you can use is commonly known as weighted averages.

Simply put, this is a running aggregation where you have a your aggregated value multiplied by a weight (often a total quantity), add your new value, and then divide by the weight + the new weight. This is easier than it sounds.

If you have an average price per share for the day (going back to the stock market example above), you can average the price by transaction (sum(price) / number of transactions). This would be one method for performing the averaging.

To make the average of a weighted average equal that of a standard average, you have to multiply the current average by the number of transactions that have already completed.

The weighted average method will have you keep a running average of the price and the number of transactions. When a new transaction comes in, the previous average price is multiplied by the number of transactions, added to the new transaction price, and finally divided by the number of transactions + 1. If you had 10 transactions at an average price of $20 and a new trade comes in with a price of $24, your weight average would be $20.37 ((10 <-- number of transactions * 20 <-- average price ) + 24 <-- new trade price ) / 11 <-- new number of transactions. When the next trade comes in at $25, your weighted average would be $20.76 ((11 <-- number of transactions * 20.37 <-- average price ) + 25 <-- new trade price ) / 12 <-- new number of transactions.

Related Topic