Caching by in-memory dictionaries. Are we doing it all wrong

caching

This approach is pretty much the accepted way to do anything in our company.

A simple example

  • When a piece of data for a customer is requested from a service, we fetch all the data for that customer(relevant part to the service) and save it in a in-memory dictionary then serve it from there on following requests(we run singleton services).
  • Any update goes to DB, then updates the in memory dictionary.

It seems all simple and harmless but as we implement more complicated business rules the cache gets out of sync and we have to deal with hard to find bugs.

Sometimes we defer writing to database, keeping new data in cache until then. There are cases when we store millions of rows in memory because the table has many relations to other tables and we need to show aggregate data quickly.

All this cache handling is a big part of our codebase and I sense this is not the right way to do it. All of this juggling adds too much noise to the code and it makes it hard to understand the actual business logic. However I don't think we can serve data in a reasonable amount of time if we have to hit the database every time.

I am unhappy about the current situation but I don't have a better alternative.

  • My only solution would be to use NHibernate 2nd level cache but I have nearly no experience with it.
  • I know many companies use Redis or MemCached heavily to gain performance but I have no idea how I would integrate them into our system.
  • I also don't know if they can perform better than in-memory data structures and queries.

Are there any alternative approaches that I should look into?

Best Answer

First you last question: Why Redis/memcached?

No, they're not (usually) faster than simple in-process dictionaries. The advantage comes when you have several worker processes, or even many app-layer machines. In that case, instead of each process having its own small cache, they all share a single big (distributed) cache. With bigger caches, you get better hit ratios.

As you can see, the cache layer becomes a shared resource, much like the database, but (hopefully) faster.

Now, about the big part: how to avoid the mess?

It seems that your problem is keeping cache consistent while at the same time decoupling it from the database. I see three pain points there:

  1. cache invalidation. This is just hard. Sometimes the easiest solution is to add a generation ID to every record and use it as part of the cache key. When the data is updated, you get a new generation ID, and the next cache query won't hit so you go to the database and refresh the cache. Of course, the (now unused) entry must have a sensible expiry time so it gets eventually purged from cache.

  2. writeback. You say you work on the cache and update the database later. This is dangerous; most architectures avoid that idea. A step on the right direction would be to mark every new or modified entry in the cache as 'dirty' so can be flushed to database by a decoupled process. A better idea might be to add to a message queue as soon as it's modified, effectively making the writing to the database 'inline but async'. In the end, I think you should realize that this is not a valid use for a cache, this is a "staging area" that should be treated with a different architecture than a cache layer.

  3. interprocess synchronization: since your in-process cache is private to each process, any modification there isn't propagated to other processes until they're flushed to database. This might be correct under your app design (kind of poor man's transaction isolation), but might have unintended results. A much more manageable architecture is a cache layer that is just a faster API to the database, with the same shared properties as the database, and just as 'authoritative' as it. For that you need out-of-process caches, like memcached or Redis.

Related Topic