Options to efficiently synchronize 1 million files with remote servers

optimizationrsyncsynchronization

At a company I work for we have such a thing called "playlists" which are small files ~100-300 bytes each. There's about a million of them. About 100,000 of them get changed every hour. These playlists need to be uploaded to 10 other remote servers on different continents every hour and it needs to happen quick in under 2 mins ideally. It's very important that files that are deleted on the master are also deleted on all the replicas. We currently use Linux for our infrastructure.

I was thinking about trying rsync with the -W option to copy whole files without comparing contents. I haven't tried it yet but maybe people who have more experience with rsync could tell me if it's a viable option?

What other options are worth considering?

Update: I have chosen the lsyncd option as the answer but only because it was the most popular. Other suggested alternatives are also valid in their own way.

Best Answer

Since instant updates are also acceptable, you could use lsyncd.
It watches directories (inotify) and will rsync changes to slaves.
At startup it will do a full rsync, so that will take some time, but after that only changes are transmitted.
Recursive watching of directories is possible, if a slave server is down the sync will be retried until it comes back.

If this is all in a single directory (or a static list of directories) you could also use incron.
The drawback there is that it does not allow recursive watching of folders and you need to implement the sync functionality yourself.

Related Solutions

Rsync – Difference Between –checksum and –ignore-times Options

Normally, rsync skips files when the files have identical sizes and times on the source and destination sides. This is a heuristic which is usually a good idea, as it prevents rsync from having to examine the contents of files that are very likely identical on the source and destination sides.

--ignore-times tells rsync to turn off the file-times-and-sizes heuristic, and thus unconditionally transfer ALL files from source to destination. rsync will then proceed to read every file on the source side, since it will need to either use its delta-transfer algorithm, or simply send every file in its entirety, depending on whether the --whole-file option was specified.

--checksum also modifies the file-times-and-sizes heuristic, but here it ignores times and examines only sizes. Files on the source and destination sides that differ in size are transferred, since they are obviously different. Files with the same size are checksummed (with MD5 in rsync version 3.0.0+, or with MD4 in earlier versions), and those found to have differing sums are also transferred.

In cases where the source and destination sides are mostly the same, --checksum will result in most files being checksummed on both sides. This could take long time, but the upshot is that the barest minimum of data will actually be transferred over the wire, especially if the delta-transfer algorithm is used. Of course, this is only a win if you have very slow networks, and/or very fast CPU.

--ignore-times, on the other hand, will send more data over the network, and it will cause all source files to be read, but at least it will not impose the additional burden of computing many cryptographically-strong hashsums on the source and destination CPUs. I would expect this option to perform better than --checksum when your networks are fast, and/or your CPU relatively slow.

I think I would only ever use --checksum or --ignore-times if I were transferring files to a destination where it was suspected that the contents of some files were corrupted, but whose modification times were not changed. I can't really think of any other good reason to use either option, although there are probably other use-cases.

Syncing files with rsync by file names only, ignore directories

Probably the simplest way to solve moving all files from on directory tree to a single directly would be using find with the -type and -exec options. The -type option limits the output to a specific type of directory entry (f for file, d for directory, etc.). The -exec option passes the name found (as {}) to a command line with options.

A couple examples follow:

find /directory/top/ -type f -exec rsync {} desthost:/destdir 

find /directory/top/ -type f -exec scp {} desthost:/destdir

Best Answer

Related Solutions

Rsync – Difference Between –checksum and –ignore-times Options

Syncing files with rsync by file names only, ignore directories

Related Topic