REST API – Caching Strategy for Record Collections

api-designhttprest

I am designing a REST API for my mobile clients to interact with our app server (built with Django/django-rest-framework if it makes any difference).

There are a number of different objects accessible through the API, some changing frequently (say daily), some almost never changing (on average less than once per month), and some for which only some nested records will be changing (think a blog post for which we add new comments a few times a day).

Because the clients are sensitive to data transfer volume (for cost reasons, mobile data in developing country), I want to limit this, especially when they download a list of object (eg: the list of blog posts objects mentioned earlier). Data transfer is by far my biggest concern here, long before server-side load.

I have thought of using something similar to the HTTP If-Modified-Since header (https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html section 14.25), that could work on individual object requests, like GET /api/blogposts/<id>/. But with high network latency (ping times of over 500ms are common), running dozens or hundreds of requests seems like a bad idea.

To obtain a collection of records, I would expect the following behaviour to help more in my case (the requests I'm talking about are similar to what is described in this answer: taylored collections per user)

GET /api/myblogposts/ would initially return a JSON list of objects, not just the IDs:

[
'post1': {...},
'post2': {...},
...
'postN': {...}
]

Then a subsequent GET on the same url with appropriate header If-Modified-Since: Sat, 29 Oct 2016 19:43:31 GMT would filter the list to return only records modified since then. The client can then merge the changes into its local data store.

Does this strategy seem to make sense? Is there some existing standard for client and server to negotiate what subset of records to transfer?

Best Answer

If you really think about it, if a record has not changed, it is in a sense "cached" already, because that updated_at timestamp has not changed; thus, your intuition about only fetching records that have been modified is the best way to go about this "caching". However, I wouldn't really call it "caching", but rather "selective retrieval".

However, as @Joeri Sebrechts mentioned in his comment, using HTTP heads in a non-standard fashion is a really good way to annoy maintainers of your code as they struggle to figure out why you're using If-Modified-Since like a query parameter to filter records. In fact, that's exactly why he suggested using a query parameter - they're used exactly for this purpose - and I fully agree.

So the solution here is to:

  1. Initially, fetch all records that you need (e.g. at startup)
    • store this value on the client as a timestamp - changed-after (or whatever you want to name it - when doing the GET
    • make sure that the record id is included so you can do some fancy merging with existing records later
  2. When the client needs to retrieve new records or update the list, simply send another GET to e.g. /records?changed-after=THE_STORED_TIMESTAMP
    • your API will only retrieve records with updated_at > changed-after
    • send those records back to the client
  3. On the client, do a merge operation on your existing list of records
    • do not delete records from the list
    • simply take the set of new records, find them in the old list, and replace them
    • leave the rest of the list unmodified

Some other applications use websockets to communicate changes to clients; e.g. the server detects a change in a record and pings all clients that an update is available for retrieval. That would be, in my opinion, the more "efficient" way to do things in the event that you have millions upon millions of records that might take a long time to query, and you have the bandwidth available for websockets. Instead of having clients constantly querying for updates that may or may not be available (and the possibility of those queries being expensive), you simply have the server tell clients when they need to update.

However, we don't know anything about the quantity and complexity of your data, but the simple fact that you have a high-latency and low-bandwitdth situation kind of eliminates the possibility of using websockets, so the query parameter update_at filter seems to be the most appropriate approach.

PS - If you're really really really tight on data, you could even implement a changelog of your records to know which fields changed, allowing you to selectively send only the fields that were actually updated, instead of the entire record. Some frameworks/languages have libraries that do this, e.g. Rails' Paper Trail. If you think that the need for very low bandwidth usage is worth adding such a dependency, I would highly recommend it. Sometimes these libraries make it ridiculously easy, like Paper Trails' methods to diff versions which gives you only the data that was changed. So you could send only that dat, along with the record's id, and selectively merge on the client on a field basis instead of a whole record basis. Neat!