Database – Best Time Series Database for High Cardinality

analyticsdatabasescalability

I have a situation where I want to track page views, click events, etc. in a time series database but am having scalability issues in the case where I want to retrieve aggregated groups of data that has very high cardinality.


The problems I'm trying to solve are:

What are the top N referrers within a given time range?

How many views does a URL have for each URL within a given time range?

How many views do specific URLs have throughout all time?


The schema I have so far is:

timestamp – Time of the event

domain – The Base URL for the record

uri – The unique resource. Would like grouped counts of these (Millions of possible values)

referrer – the HTTP referrer. Grouped counts of these as well (Millions of possible values)

event – The type of event


So far I've tried using InfluxDB, but discovered issues due to the sheer amount of possible values for the uri and referrer. Although I would only be scanning for record within a small time range, grouping by millions of unique possible values makes this a lot harder. What other options do I have to store data that support both the write/query requirements?

Best Answer

Performing complex queries in real time on large time series data is not scalable. It puts an unreasonable load on your store, and will not perform well regardless of the database used. The only time you should be querying your time series database directly is when you want to look at individual events (e.g. a table of events or look at a single log entry).

You should index your data in a way that is scalable to query. For example, you can aggregate views over a period of time (minutes, hours, days, weeks), or grouped by some other metric (referrer, event type, or domain), so when you need to query for the views of a given entity for a month, you're querying hundreds of rows, not millions.

The index can be stored wherever you want (e.g. we stored ours in a relational database) because the load and data size should be much smaller. You can create the index by either a stream analytics pipeline, or periodically by a batch process. Using streaming analytics means your data will be (nearly) immediately available for querying, but it will likely be more complex to do right. Batch processing will have some latency before data is queryable, but is easier to implement (usually just a cron job that runs periodically and indexes the latest data). For a stream analytics framework, check out Apache Spark Streaming. For batch processing, Apache Spark is a popular choice.