Best Ways to Understand How to Cache Domain Objects in Java

cachingjava

I've always done this wrong, I'm sure a lot of others have too, hold a reference via a map and write through to DB etc..

I need to do this right, and I just don't know how to go about it. I know how I want my objects to be cached but not sure on how to achieve it. What complicates things is that I need to do this for a legacy system where the DB can change without notice to my application.

So in the context of a web application, let's say I have a WidgetService which has several methods:

Widget getWidget();
Collection<Widget> getAllWidgets();
Collection<Widget> getWidgetsByCategory(String categoryCode);
Collection<Widget> getWidgetsByContainer(Integer parentContainer);
Collection<Widget> getWidgetsByStatus(String status);

Given this, I could decide to cache by method signature, i.e. getWidgetsByCategory("AA") would have a single cache entry, or I could cache widgets individually, which would be difficult I believe; OR, a call to any method would then first cache ALL widgets with a call to getAllWidgets() but getAllWidgets() would produce caches that match all the keys for the other method invocations. For example, take the following untested theoretical code.

Collection<Widget> getAllWidgets() {
    Entity entity = cache.get("ALL_WIDGETS");
    Collection<Widget> res;
    if (entity == null) {
        res = loadCache();
    } else {
        res = (Collection<Widget>) entity.getValue();
    }
    return res
}

Collection<Widget> loadCache() {
    // Get widgets from underlying DB
    Collection<Widget> res = db.getAllWidgets();
    cache.put("ALL_WIDGETS", res);
    Map<String, List<Widget>> byCat = new HashMap<>();
    for (Widget w : res) {
        // cache by different types of method calls, i.e. by category
        if (!byCat.containsKey(widget.getCategory()) {
            byCat.put(widget.getCategory(), new ArrayList<Widget>);
        }
        byCat.get(widget.getCatgory(), widget);
    }
    cacheCategories(byCat);
    return res;
}

Collection<Widget> getWidgetsByCategory(String categoryCode) {
    CategoryCacheKey key = new CategoryCacheKey(categoryCode);
    Entity ent = cache.get(key);
    if (entity == null) {
        loadCache();
    }
    ent = cache.get(key);
    return ent == null ? Collections.emptyList() : (Collection<Widget>)ent.getValue();
}

NOTE: I have not worked with a cache manager, the above code illustrates cache as some object that may hold caches by key/value pairs, though it's not modelled on any specific implementation.

Using this I have the benefit of being able to cache all objects in the different ways they will be called with only single objects on the heap, whereas if I were to cache the method call invocation via say Spring It would (I believe) cache multiple copies of the objects.

I really wish to try and understand the best ways to cache domain objects before I go down the wrong path and make it harder for myself later. I have read the documentation on the Ehcache website and found various articles of interest, but nothing to give a good solid technique.

Since I'm working with an ERP system, some DB calls are very complicated, not that the DB is slow, but the business representation of the domain objects makes it very clumsy, coupled with the fact that there are actually 11 different DB's where information can be contained that this application is consolidating in a single view, this makes caching quite important.

Best Answer

To be useful a cache must provide access to data faster than it can be retrieved from the database. Given that most database calls involve a network roundtrip it makes sense to cache whenever it is known that the output (value) will not change for the same input (key) over a known dimension (time, size, etc).

Thus if the inputs are unpredictable, or undecipherable, so cannot be bound to a single key then caching may not be the best solution - you'll just end up trying to write (and maintain) a poor quality database. Invest the money in a faster network instead.

It doesn't really matter if your cache contains more than one copy of a particular object so long as the key is useful to a downstream consumer. The objective of the cache is to improve performance for different consumers of the data which may approach the underlying dataset from different standpoints (say Customer-centric or Invoice-centric).

In terms of implementation at a small scale, you might want to look at the Cache classes in the Guava library instead of EhCache. A typical example of a self-populating cache would be:

LoadingCache<Key, Graph> graphs = CacheBuilder.newBuilder()
       .maximumSize(1000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader<Key, Graph>() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });

As you can see it is very straightforward to work with and Guava provides a wide variety of cache eviction strategies (size, reference, age etc). The Guava library also provides a wealth of useful utility classes that augment those found in the JDK so your overall codebase will benefit.

Thus your approach of decorating your DAO methods with a class that combines a dedicated cache and the DAO resultset with a derived key is sound. Each call to the method causes an initial local key lookup before returning which is what you appear to be looking for. Couple this with an appropriate eviction strategy tuned for each method and you have a simple to understand and maintain solution that should scale well.

Related Solutions

Database – What’s the best way to cache a growing database table for html generation

Unless I'm misunderstanding the question, I don't think that this is an appropriate scenario for caching.

Cached data normally has at least one of the following attributes (usually all of them):

Expensive to retrieve or compute;
Highly static - may change occasionally but very rarely;
Non-critical - OK if the requester sees stale data.

It doesn't sound like any of these apply to your situation.

The query is a simple SELECT, probably TOP N, just an index seek;
It changes very frequently;
Your requirements indicate that immediate updates are required.

So why are you caching? Caching isn't a panacea; oftentimes it can actually make performance worse, if the cache memory could be better used for some other purpose.

Databases do their own caching. As long as the DB server has plenty of memory then it may cache the entire table in memory if it's frequently queried; the performance of that will be just as good as your cache if not better.

Some further thoughts/suggestions:

If stale data is OK, then the simplest solution would be to use a fixed interval (i.e. expiration). This method is used very effectively in hundreds of thousands of sites and systems. You can either force an update on expiration or just wait until it's requested again.
If you're concerned about conflicts between reads and writes, then (a) don't be, until you've profiled it, and (b) if it really is an issue then instead of trying to cache it, just use a redundant table or a NOLOCK hint.

If you need to invalidate the cache every single time a row is added/changed then you have completely defeated the purpose of an application cache, and are now trying to implement an in-memory database. Please don't do this unless you have an extremely good reason for it.

Caching strategies for entities and collections

I have experimented with different approaches to object caching, and I see advantages to an approach where collections are cached as references rather than actual objects.

An example:

User GetUser(int ID);
ICollection<User> GetRecentUsers(int amount);
ICollection<User> GetActiveUsers(int amount);
void Update(User user);

In this example, if all three "Get"-methods load, cache and return actual user data, the data will go stale whenever an update is made to an object contained in one of the collections.

Delegating cache responsibility

Flushing everything every time you do an update however will result in a pretty useless cache. So you need some way of knowing what has gone stale. Unfortunately, keeping track of this is no easy task.

Rather than trying to keep track on what data is located where, you could create a scheme where you delegate handling of data and keep cache logic in i single place.

In the example above GetUser(int ID) could be the only method allowed to actually load and cache user objects.

GetRecentUsers() and GetActiveUsers() would then be responsible for loading only a list of user IDs and call GetUser() to actually populate a collection of users before returning the result.

In this way, when you update a user object, you only have to invalidate exactly one cached object.

The list of IDs held by GetRecentUsers will be invalidated only when a new user joins the site etc.

This gives you a more clean approach to keeping your cache tidy. It does however introduce its own set up problems.

Pitfalls

If you start with a cold cache and load the 10 most recent users you will have a N+1 problem on your hands. Basically running a query to fetch the list of users, and 10 additional queries to populate the cache.

This can be a big problem depending on your data and setup. You can however take measures against these problems by doing pre-warming of the cache, or allowing for batch operations (which again introduces complexity).

There is no silver bullet. But I find the delegation approach useful when implemented with care.

Best Answer

Related Solutions

Database – What’s the best way to cache a growing database table for html generation

Caching strategies for entities and collections

Related Topic