Introduction and context
I am refactoring a system that stores energy data (gas/electricity consumption, electricity delivery). Data comes from a sensor that posts the absolute data every 15 minutes in a schema that is subject to change in the future (read: requires document storage).
Current situation
The raw data is stored in a MongoDB database. A background process reads this data, and converts it to structured delta data which is stored in a MySQL database, in different resolutions.
MongoDB schema:
| device_id (uint) | timestamp (ISODate) | data (unspecified object) |
MySQL schema: (multiple tables: hour
, day
, month
, year
)
| device_id (uint) | timestamp (datetime) | gas_consumption | electricity_consumption | ... |
If the MongoDB data
field changes it's definition, the background process will detect the version of the data and process accordingly. The MySQL database may be recreated from the MongoDB data, possibly with a new schema (if the future may require it).
Envisioned situation
In the new situation, I'd like to use a time series database (as this is suggested by lots of external parties). There are two features that I'd like to implement on an atomic (database) level:
- Gap filling for missing data. It is possible for the quarterly data to be missing a couple of points. Since the absolute values are known, this data should be filled linearly distributed.
- Storing different resolutions. The data will be queried over multiple resolutions (hour to year), and all of these queries should be fast.
Requirements:
- Should scale up to one million devices (data input).
- Should scale up to 50k concurrent users (data queries).
- Aggregated data should be available as soon as possible. (up to 1 hour is acceptable)
- Data may not deteriorate. (full history should be available at the highest resolution)
- Old data may be updated (on the highest resolution, and that change should be reflected onto lower resolutions). This is very uncommon, but should be possible. (If this turns out to be a big issue, this requirement might be handled in another way, but that is not the preferred solution).
- Free as in free beer.
- The data has to be accessible from NodeJS, since all code is written in NodeJS.
- It would be preferable if the latest data point (the current year for example) would be filled with the sum of the previous tiers before it is over, otherwise this has to be requested from the quarters table. Example: on July 1st 2017 the data point for 2017 should contain the sum of the year up to that date, along with an indication of the elapsed time. (This is a nice to have).
The typical query looks like this (pseudo-code):
SELECT <fields> BETWEEN <start-date> AND <end-date> WITH RESOLUTION <time-resolution>
.
Current considerations
I've looked at some databases that support time series (InfluxDB, MongoDB, Graphite, RRDtool, Redis, OpenTSDB, …) but I can't find one that natively supports the different resolutions and/or the gap filling.
MongoDB might be an option, but the write is fire-and-forget, which means that missing data might not be detected. For the raw data, this is not an issue, as long as it doesn't happen very often. For the processed data, this is a big issue, as the rest of the application assumes that the data in those tables is sequential, even in a very uncommon edge case.
MySQL could also be an option (e.g. not changing the initial implementation), but I feel like the time series databases are optimized for time series database by a reason.
Question
Is the current situation a good approach? And if it isn't, what would be a better one? I feel like my use case is not that unique in the world of storing time series, but I can't seem to find any resource online that can give me a push in the right direction.
Best Answer
I worked on a similar application for the California Energy Commission (UT3, or Universal Translator 3). A couple points: