Centralized distribution/syncing sets of large files through local network

client-serverdistributionsynchronization

Even though I am fully aware that versions of this question have been asked googol number of times, I'll try not to repeat them.

I have many sets of many files (some files are small, but some are large, like, ~10-20GB). I have multiple servers, each one can host one or more of those sets of files. Of course, one server can host 50% of total number of sets, and other 50% can host another number of sets.

You can think of set as of collection of large media files, really big image libraries, complete applications, whatever, it doesn't really matter, as long as there are large files in the set.

Server can update its copy of set at any point in time (either by replacing files in the set with completely new files, or by applying patches to some of files, which would result in having almost same files with only slight differences).

On the other side, I have many clients, who should be able to obtain any given set (or multiple sets) from servers, and keep their copies of sets up-to-date (synchronized) with sets on server, whenever one wants to use the set.

The tools that I have considered are following:

  • rsync — It's great for syncing many small-to-medium-sized files, but not so ideal when syncing large files, as it uses algorithm which reads entire file at both sides in order to determine if file should be copied or not. This is okay when file should be copied for the first time, or when file is completely changed, but not-so-okay, when, say, only 1% of 10GB file is changed.
  • SVN — It's great when it comes to finding differences and transferring only those deltas around, but I'm not so sure how optimal it is when it comes to disk usage (will entire set be twice as big on both client and server, due to once set is stored in repository?).
  • Torrent — This one could be feasible, distribution-wise. For instance, create a torrent for each set on server, start seeding it there, and clients that receive those sets also continue to seed to other clients, thus distributing the load across every computer that holds copy of set. However, I'm not sure if it would be able to somehow distribute differences, once set on server gets changed… Would it require creation of new torrent for each change? Also, I don't know how torrent would behave in local network, speed-wise (could it be able to transfer files between one server and one client at maximum, network-limited speed. or it adds some serious protocol overhead? How about network congestion?)
  • Custom solution. Well, not much to add here, but that it would most likely be re-inventing the wheel, and that some existing solution would most likely fit my needs, if I was only aware of it.

So, the question is: what method of distribution/synchronization (utilities, approach) would be best suited for my situation?

Best Answer

If you can safely assume that all the clients will have consistent versions you could use an off-the-shelf binary patching tool and roll your own solution to push out diffs to clients and apply them. If the clients will have inconsistent versions, though, you'll have to read the file on the client in order to determine which diffs need to be sent (basically the rsync problem). If the clients are consistent, though, you could just compute the diffs once and ship them out.

It sounds like you're looking for something like a multicast rsync implementation. I've never used this tool, but it would be worth looking at. It looks like they're only targeting Linux and Unix OS's right now.