Technique to sync medium/large amounts of data

synchronization

I implemented sync for my app (with a REST backend), it works well but there are some problems:

  1. Blocks – Since I don't want to overwrite possible inputs of the user during sync, I block the UI (progress overlay) during sync. This of course affects negatively the user experience.
  2. Size – If the data is large the request may take a lot and there may also be out of memory errors processing the response.

So I'm thinking how to improve this. For point 2, I thought to simply paginate the request, i.e. send how many entries I want and a reference date (e.g. creation date) to make the database query simpler.

But point 1 is a problem independently what I do for point 2, as long as the user can manipulate data during sync, there's a possibility this this data will be overwritten with the sync result. I could for example lock the client database during sync, such that only the sync process can write to it, and queue the client operations, but what happens when these operations try to e.g. increment some data which was just changed by sync and this leads to a different result than what the client intended to do.

My app is about personal grocery management so the requirements are not super strict, if an update is lost or something it's not tragic. I'm looking for a balanced solution, which is easy to implement while keeping a fairly good UX.

Best Answer

Blocks

The problem you are referring to for scenario 1 is consistency, that is, having 2 stores holding the same information and needing them to be consistent.

In a distributed environment like this, optimistic concurrency can prevail, meaning, instead of locking, allow for anything to change at any time, and build-in a mechanism to track overlaps.

First step, is there any data which truly has overlaps? Or is your data set truly just a set of items? If the latter, then if you are in a situation where items can be added by any client, and they replicate down, then you simply need each client to keep track of which items have not yet been pushed.

If there can be overlaps, for example, you are managing documents (sounds unlikely), then you either have locking, or allow concurrent updates and flag overlaps as a problem to be resolved by the consumer. Examples of this pattern:

  1. Software version control systems like git, mercurial, subversion, etc.
  2. Evernote
  3. Google Docs

If you have ever worked with software version control, you'll recognize the idea of a merge conflict and/or overlap. That is the same problem you have. Source control systems like Vault avoid this problem by each client treating each file as read-only until a file is "checked out", which creates a mutually exclusive lock across all clients. Git, on the other hand, will identify conflicts, merge them if it can do so, and if not, force the end user (developer) to manually merge.

Evernote allows my wife and I to each edit our grocery list. To minimize conflicts, we sync as often as possible, but there will always be a chance of us having conflicting changes, which has happened. Evernote simply appends conflicting document versions on top of each other with markers indicating there was a conflict.

In a case where concurrent changes are rare, usually this sort of optimistic concurrency is ideal. In a highly concurrent transaction system, locking may be a better approach.

Size

As far as data size, you have two options I can think of:

  1. Pagination
  2. Streaming

Pagination is probably the way I'd go because it's simpler to code, usually, and easier to track your progress, and in a way, gives you more control. With my recommendations for handling consistency above, hopefully pagination becomes a non-issue.

Streaming may be an alternative for you too and would reduce the risk when it comes to consistency since you'd have a single round-trip. Streaming basically means to change this data flow:

results = getResultsFromServer();
for (result in results) {
  doSomething(result)
}

to:

getResultsFromServer((result) => {
  doSomething(result)
})

In the former, the entire result set must be buffered so that you can iterate over it, even though you only consume the data one at a time in a forward-only manner. In the latter, you only need one item at a time. I implemented this approach once for a large report that would overwhelm our production servers when a customer ran it for a very large date range.