Database vs Flat files for rarely accessed data

databaseMySQL

We are receiving realtime data from over a 1000 sensors, each of which sends a data point every 10 seconds in average, which is amounting to about 1 million rows of data every day. Our system gives users the ability to select any sensor and a date range and download the data points as an Excel file.

We have seen that our users are mostly interested in data that is less than 30 days old. Data that is more than 30 days old is most probably already been downloaded. Only about 1% of our data retrieval requests come for data that is more than 30 days old. However, we cannot say that this data is totally useless, because our users sometimes want to download data that is even more than a year old. Deleting old data is not a possibility.

Currently we are using MySQL database to store the data, and all data is being stored into a single table. The table now has over 60 million rows. We use SSD and we have the right indices due to which the data retrieval still happens considerably faster.

An example database query we use to select every minute data is:

SELECT
    data_value AS value, param_id AS param_id,
    data_timestamp AS ts,
FROM tbl_data_log
WHERE param_id in (?)
    AND data_timestamp >= ? AND data_timestamp <= ?
GROUP BY DATE(data_timestamp), HOUR(data_timestamp),
    MINUTE(data_timestamp),
    param_id
ORDER BY data_timestamp ASC

Currently, this query takes less than 5 seconds for retrieving data that is more than 30 days old for a specific sensor.

As more data is stored into this table, it is going to get bigger, maybe up to 2 billion rows in the next 1 year (we are also adding more sensors everyday). I do not know how the query performance would be at that stage. To me, storing all this data in a MySQL database doesn't seem to be right, because it is accessed very rarely, and having data that is more than 4 months old indexed, seems unnecessary.

One approach I thought of is to have only last 30 days of data in MySQL, and move old data to flat files with a folder structure like /old_data/%YEAR%/%MONTH%/%DATE%/%PARAM_ID%.dat. This way our data size is not going to become unmanageable but at the same time data is still indexed in the form of flat files on disk.

Is the current approach good to scale? Does moving old data to flat files help or not? Is storing all data in a single table correct? Do we need to change our database engine itself? Please give your thoughts on this architecture. Thank you very much in advance!

Best Answer

Database engines are in principle designed to cope with huge amounts of data much faster than with raw data files, when you have to access data in a non-sequential manner.

You say that you have all the right indexes on your table to get an optimized access, so data_timestamp is certainly indexed. However, I see in your query example that you use:

GROUP BY DATE(data_timestamp), HOUR(data_timestamp),
    MINUTE(data_timestamp),
    param_id

This forces your database engine to convert the timestamp of every row matching the where of the query do date, which I suspect is very time consuming.

As I suppose that date and time are comonly used in your application, I'd suggest to consider a little bit of denormalization here to facilitate the database's job by precomputing the date (DATE type) and the time rounded to the minute (either TIME type or eventually a SMALLINT between 0000 and 2359). That's 5 bytes overhead per row. Create an index on them for accelerating the GROUP BY clause.

If this is not sufficient, make sure the server is correctly dimensioned for its big data challenge, and look if your DBMS is sufficiently well placed in benchmarks with other DBMS.

As a work around you also could consider using two tables: an active table for the last 30 days, and a second table with all the historical data older than 30 days. Some batch jobs would then move the expiring data from one table to the other.

Related Topic