The application I'm working on has a need to create a counter representing the number of times each piece of content has been viewed.
There are two goals here:
- Update the counter in real time
- Minimize load on MySQL
Currently what we do, is make a cache key in Redis for each piece of content that contains the view count. When a view event happens, we increment the view count. If there is no value yet at the key when a read or write happens, we calculate the all-time views count using a separate data source (Influxdb).
The problem is, our existing approach won't' be feasible anymore because of the way we're restructuring our InfluxDB data. It is no longer adequately performant to calculate the all time views count for a card using Influx data, for reasons I won't get into here (feel free to ask).
Essentially, we no longer have a way to calculate the all time views count 'from scratch'. We will need to rely on the existing counter values and only increment them (never completely re-calculate them).
I have the following idea to do this:
- Calculate all-time views on each content and store in MySQL (this can be done once to seed the data)
- When a read/write happens for the first time, look it up in MySQL and store it in Redis
- Whenever a write happens, increment the count on Redis
- In a background job, once per hour or so, update the views count in the MySQL database using the data in Redis.
This will produce at maximum one hour of data loss, if Redis drops all the keys one minute 59 right before the background job happens.
Does this approach make sense? If there a better way to do it?
edit
Now that I think about it a little more, the concept of a background job that updates all the cards is a little problematic. If we have millions keys stored in the cache, how should the application know which had activity and should have their Redis counts copied to MySQL?
So, I'm thinking that there can be a special key on Redis which stores a queue of all the card ids that have had updates. Then the background job can see the unqiue card ids in this list, and request only that data for the update.
Best Answer
One technique that can be helpful is to use probabilistic techniques instead of fixed schedules. Whenever a process updates the cache they can roll the dice to determine whether that key (or a range of keys) should be synced. If you make the syncing probability dependent on the rate of writes to each key (like p=1/n for n recent modifications) that can help you maintain a consistent syncing rate across all keys. In contrast, a fixed schedule could lead to load spikes to your databases, or could be mismatched with the rate of change (such as syncing keys that have not changed). The difficult part for this solution is keeping track of the rate of modifications.
You may also want to evict unused keys from the cache. You can perhaps subscribe to eviction events so that you can persist them before they are removed.
It might also be worth considering whether trying to synchronize two databases is a sensible solution (in particular, whether your performance requirements warrant all this complexity). Alternatives:
Of course, aspects of these solutions can be combined as necessary. I'd consider a cache that is written to as in your question, but route all persistent writes through a queue that can coalesce the writes. E.g. some event would update both the queue and the cache, but there would be no service that scans the cache for keys that have to be updated.