Database – Efficiently storing time series at multiple resolutions

databaseperformance

Introduction and context

I am refactoring a system that stores energy data (gas/electricity consumption, electricity delivery). Data comes from a sensor that posts the absolute data every 15 minutes in a schema that is subject to change in the future (read: requires document storage).

Current situation

The raw data is stored in a MongoDB database. A background process reads this data, and converts it to structured delta data which is stored in a MySQL database, in different resolutions.

MongoDB schema:
| device_id (uint) | timestamp (ISODate) | data (unspecified object) |

MySQL schema: (multiple tables: hour, day, month, year)
| device_id (uint) | timestamp (datetime) | gas_consumption | electricity_consumption | ... |

If the MongoDB data field changes it's definition, the background process will detect the version of the data and process accordingly. The MySQL database may be recreated from the MongoDB data, possibly with a new schema (if the future may require it).

Envisioned situation

In the new situation, I'd like to use a time series database (as this is suggested by lots of external parties). There are two features that I'd like to implement on an atomic (database) level:

  • Gap filling for missing data. It is possible for the quarterly data to be missing a couple of points. Since the absolute values are known, this data should be filled linearly distributed.
  • Storing different resolutions. The data will be queried over multiple resolutions (hour to year), and all of these queries should be fast.

Requirements:

  • Should scale up to one million devices (data input).
  • Should scale up to 50k concurrent users (data queries).
  • Aggregated data should be available as soon as possible. (up to 1 hour is acceptable)
  • Data may not deteriorate. (full history should be available at the highest resolution)
  • Old data may be updated (on the highest resolution, and that change should be reflected onto lower resolutions). This is very uncommon, but should be possible. (If this turns out to be a big issue, this requirement might be handled in another way, but that is not the preferred solution).
  • Free as in free beer.
  • The data has to be accessible from NodeJS, since all code is written in NodeJS.
  • It would be preferable if the latest data point (the current year for example) would be filled with the sum of the previous tiers before it is over, otherwise this has to be requested from the quarters table. Example: on July 1st 2017 the data point for 2017 should contain the sum of the year up to that date, along with an indication of the elapsed time. (This is a nice to have).

The typical query looks like this (pseudo-code):
SELECT <fields> BETWEEN <start-date> AND <end-date> WITH RESOLUTION <time-resolution>.

Current considerations

I've looked at some databases that support time series (InfluxDB, MongoDB, Graphite, RRDtool, Redis, OpenTSDB, …) but I can't find one that natively supports the different resolutions and/or the gap filling.

MongoDB might be an option, but the write is fire-and-forget, which means that missing data might not be detected. For the raw data, this is not an issue, as long as it doesn't happen very often. For the processed data, this is a big issue, as the rest of the application assumes that the data in those tables is sequential, even in a very uncommon edge case.

MySQL could also be an option (e.g. not changing the initial implementation), but I feel like the time series databases are optimized for time series database by a reason.

Question

Is the current situation a good approach? And if it isn't, what would be a better one? I feel like my use case is not that unique in the world of storing time series, but I can't seem to find any resource online that can give me a push in the right direction.

Best Answer

I worked on a similar application for the California Energy Commission (UT3, or Universal Translator 3). A couple points:

  • The "filling in" algorithm may not be correct mathematically. The data in question is typically average values. The correct way to resample in time, average value data, is to use a time weighted average. It is a type of integration.
  • On the other hand, you may have some instantaneous type data, that needs a different type of numerical integration. I suggest you study the mathematical properties needed more closely.
  • The best performance will come from using a custom database that stores only the original, non-resampled data, in a binary form. A "custom database" can consist of a set of read-only files, each of which contains a data stream. A time-series specific database is probably fast enough.
  • The re-sampling algorithms are fast enough that they can be performed in real-time on the client.
  • If the end user sampling frequency is much lower than the original data, of course it makes sense to cache the resampled streams on the server. Access to the original data makes client based designs (like Universal Translator) more appealing for some uses.
Related Topic