Data Synchronization – Best Practices to Keep Different Data Sources in Sync

repositorysynchronization

I'm having doubts about how to implement and keep synchronization between two datasources in a distributed system. In my case, I have a service that checks for expired jobs in a repository. If the job has expired, then it is removed from the repository and enqueued in a distributed queue (the example is in Python but should be easy to understand):

def check_expired_jobs(self):
    jobs = self._job_repository.all()

    for job in [ job for job in jobs if job.has_expired() ]:
        self._job_queue.enqueue(job.crawl_task)
        self._job_repository.delete(job)

My concerns are that a lot of things may happen here since both queue and repo are remote data sources. If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue. It's not the first time I encounter such a problem and I want to tackle it in the best way possible.

What would be the best practice to keep several data sources/repositories in sync?

Best Answer

If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue.

The simplest solution in this case it just to make it not an issue. You're designing this thing, right?

You just need to write any program that reads the repository to treat expired records as if they don't exist. This is actually key to any sort of fault tolerance-- you need to allow the queue to grow and shrink (even a little bit) in any reasonable sort of fault-tolerant situation-- so you ought to make this a rule anyway, if you haven't. You can do it with a view if you don't want to pollute your c# code.

Then, in a separate process, occasionally copy (not move) any expired records into the distributed queue. If the copy fails, heck, just try copying them again a few seconds or a few minutes later. I say "copy, not move," because the repository shouldn't care whether an expired record exists or not.

Of course you will eventually run out of disk space with expired records. So you also need a simple job, running maybe once every 24 hours, to physically delete the expired records, if and only if they exist somewhere in the distributed queue. You can shorten that if you need high performance. You can even do it immediately every time you add to the distributed queue.

The only difficulty really is ensuring that copying the expired records doesn't result in duplicates in the distributed queue system. You can accomplish this very simply by tagging each job with a GUID and enforcing a uniqueness constraint. Very straightforward for a database working in isolation.

Don't monkey with 2PC unless you are doing this for self-education. Given your requirements it is way overkill.

Best practice to keep different data sources in sync?

Best practice is KISS.