Caching strategies for entities and collections

We currently have an application framework in which we automatically cache both entities and collections of entities at the business layer (using .NET cache). So the method GetWidget(int id) checks the cache using a key Widget_Id_{0} before hitting the database, and the method GetWidgetsByStatusId(int statusId) checks the cache using Widgets_Collections_ByStatusId_{0}. If the objects are not in the cache they are retrieved from the database and added to the cache.

This approach is obviously quick for read scenarios, and as a blanket approach is quick for us to implement, but requires large numbers of cache keys to be purged when CRUD operations are carried out on entities. Obviously as additional methods are added this impacts performance and the benefits of caching diminish.

I'm interested in alternative approaches to handling caching of collections. I know that NHibernate caches a list of the identifiers in the collection rather than the actual entities. Is this an approach other people have tried – what are the pros and cons?

In particular I am looking for options that optimise performance and can be implemented automatically through boilerplate generated code (we have our own code generation tool). I know some people will say that caching needs to be done by hand each time to meet the needs of the specific situation but I am looking for something that will get us most of the way automatically.

Best Answer

I have experimented with different approaches to object caching, and I see advantages to an approach where collections are cached as references rather than actual objects.

An example:

User GetUser(int ID);
ICollection<User> GetRecentUsers(int amount);
ICollection<User> GetActiveUsers(int amount);
void Update(User user);

In this example, if all three "Get"-methods load, cache and return actual user data, the data will go stale whenever an update is made to an object contained in one of the collections.

Delegating cache responsibility

Flushing everything every time you do an update however will result in a pretty useless cache. So you need some way of knowing what has gone stale. Unfortunately, keeping track of this is no easy task.

Rather than trying to keep track on what data is located where, you could create a scheme where you delegate handling of data and keep cache logic in i single place.

In the example above GetUser(int ID) could be the only method allowed to actually load and cache user objects.

GetRecentUsers() and GetActiveUsers() would then be responsible for loading only a list of user IDs and call GetUser() to actually populate a collection of users before returning the result.

In this way, when you update a user object, you only have to invalidate exactly one cached object.

The list of IDs held by GetRecentUsers will be invalidated only when a new user joins the site etc.

This gives you a more clean approach to keeping your cache tidy. It does however introduce its own set up problems.

Pitfalls

If you start with a cold cache and load the 10 most recent users you will have a N+1 problem on your hands. Basically running a query to fetch the list of users, and 10 additional queries to populate the cache.

This can be a big problem depending on your data and setup. You can however take measures against these problems by doing pre-warming of the cache, or allowing for batch operations (which again introduces complexity).

There is no silver bullet. But I find the delegation approach useful when implemented with care.

Best Answer

Related Solutions

Python Design Patterns – Caching Factory Design Explained

Related Topic