C# – How to sync data between two detached systems using rest API

cevent-programmingsynchronization

I own system A, and other entity owns system B. I need to sync system B, based on changes that took place in system A. Data being synced would be spitted into different sets, so I dont have to sync entire system at once. I have to use rest API.

What are the options to sync data? I am aware of these two ways.

1. Full sync

First solution that comes to my mind is do a full sync, but despite the fact it would resolve any discrepancy between systems, it would obviously cause a high network usage and lengthy sync if a data on system A gets too big.

2. Event based sync

Other solution would be based on events. If a list 1 is updated, and after related event takes places, it would trigger an update of list 2. If I go with this solution I would use event based messaging, but what happens if a message gets lost in the process?

Would a combination of event based sync and full sync (less frequent) do the job?
Are there any other methods to complete this task?

Best Answer

1. Full sync

I would recommend always developing a full sync. Even if you have a partial sync, it's not wrong to do a full sync once in a while to ensure that everything is still up to snuff.

Note that you could also not schedule a full sync but keep it manually triggerable by whoever supports the application.

2. Event based sync - First interpretation

I was a bit confused about your explanation here. Initially, I understood this as an event that was fired by B, so that you'd get an immediate update when a given item was updated in B.

The benefit of doing so is a very quick replication time, you are immediately aware of any changes. It also minimizes the data transfer size.
As a downside, event driven systems are often harder to debug.

You also run into the issue of what happens when the network connection between A and B is severed (or when A is offline) but B is still processing updates. The only way to work around this is B were to have a message queue where missed events would be stored until you confirmed having received them, but I surmise that you have no control over B's development.

This is why you want to have a full sync once in a while. It fixes any problems that may have occurred at one point or another.

Would a combination of event based sync and full sync (less frequent) do the job?

So if this interpretation is correct; then you're correct that an event-driven partial sync would be better off with an infrequent full sync to fix unavoidable issues that may occur once in a while.

2. Event based sync - Second interpretation

But on a re-read, it started seeming more like B wasn't involved with the events. You're planning to do several sync operations, and want to chain these one after the other using events internally in your application.

There are many ways for you to achieve an orchestration of sync operations. Synchronous, asynchronous, scheduled, triggered, event driven, ...

But the focus of your question seems to be on maximizing data correctness while keeping network usage to a minimum. How you handle things internally in your application has no influence on that (barring any bugs in your code which again seems out of scope for the question at hand).


There is a third option here, but it would require B to implement this as well.

3. Differential sync.

The key difference between a differential sync and a full sync is that a differential sync only gives you the items that have been updated since the last time you synced.

This gives you the benefits of a full sync (i.e. you control when you receive the updates), but lowers network usage by removing items that need no update (since they haven't changed anyway).

Differential syncs can come in many different flavors:

  • B might track when you last did a sync. Or A might be required to pass a timestamp so that A can choose to redo a previous sync operation by using an older timestamp, or do a full sync by omitting the timestamp. (I prefer the latter option).
  • B might send you the full data of the updated items, or it might only send you the updated fields of the updates items. The latter is much more efficient in terms of network usage, but it's notable more complex to implement both on A and B's side.

This is the underlying principle on which versioning tools like Git or SVN are based. It dramatically lowers data usage (because an update usually only changes a small part of the whole), but requires more data handling logic compared to a full sync (which is a simple overwriting operation).

However, this requires B to develop the differential sync feature. You can't do that by yourself (unless you are currently already able to query on "last updated" fields).

Note

If items can be deleted on B's side, a differential sync can become an issue. If item A is not part of the differential sync data, that could mean either that A has been removed, or A simply hasn't been updated yet.

However, I expect most systems to implement a soft delete; which solves the issue as a soft delete is an update to the item.
If B does allow for hard deletes; then you can't use a differential sync (unless B is able to update you about deletions as well).

Related Topic