Database – How to efficiently store big time series data

databaseperformancepython

I need to store and to be able to query some very large amounts time series data.

Properties of the data are as follows:

  • number of series : around 12.000 (twelve thousand)
  • number of data points, globally : around 500.000.000 per month (five hundred millions)
  • mixed value types: the majority of data points are floating point values, the rest are strings
  • sampling period : variable between series as well as within a series
  • timestamps : millisecond precision
  • data retention period : several years, without decay or downsampling
  • data archives need to be built in near realtime, but a reasonable delay (~1 hour) is acceptable
  • past data can be rebuilt if needed, but at a high cost
  • sometimes, but quite rarely, some past data needs to be updated

Properties of envisioned queries:

  • most of the queries against the data will be timestamp-based queries; ranging from one day to several months/years. 90%+ will be queries on the most recent data

Other requirements:

  • the solution must be free as in free beer and preferably opensource

My initial thought thought was to use PyTables / Pandas with HDF5 files as storing backend instead of an SQL database.

Questions :

  1. Assuming PyTables / Pandas is the "best" route, would it be better to split the data in several HDF files, each one spanning a given period of time, or put everything in a single file that would then become huge ?

  2. Should I go and prefer the fixed or the table format ? To me, fixed format looks OK if I keep one HDF file per month, as this way a whole series probably fits in RAM and I can slice in-memory without needing a table format index. Am I correct ?

And if that's not the best approach, how should I structure this data store or what technologies should I be considering? I'm not the first to tackle storing large sets of time series data, what is the general approach to resolving this challenge?


Other approaches I have considered :

  • array databases: they are a superb fit for time series with constant sampling period, as you then only need to store start and end times and sampling period of the array, and then only values in the array itself and indexing is easy. But with variable sampling periods within series themselves, I need to keep a closer timestamp->value relation, that in my view is not such a good fit for array DBMS.
  • standard SQL database with timestamp,paramID,value as columns but by their nature they request a lot of disk I/O for any query

Best Answer

You might want to take a look at carbon and whisper, part of the graphite project. Carbon can handle very large amounts of time series data. Though, now that I read the docs (it's been a few years since I've used it), it's only for numerical data. You said you also have string data so you might not find this useful. Though, you might be able to glean some wisdom about how they are able to process large amounts of data quickly.

To give you an idea of how well it scales, when graphite was first put into production at Orbitz, it was handling 160,000 metrics per minute.

Related Topic