Synchronizing very large folder structures

backuprsyncsynchronizationunison

We have a folder structure on our intranet which contains around 800,000 files divvied up into around 4,000 folders. We need to synchronize this to a small cluster of machines in our DMZs. The depth of the structure is very shallow (it never exceeds two levels deep).

Most of the files never change, each day there are a few thousand updated files and 1-2 thousand new files. The data is historical reporting data being maintained where the source data has been purged (i.e. these are finalized reports for which the source data is sufficiently old that we archive and delete it). Synchronizing once per day is sufficient given that it can happen in a reasonable time frame. Reports are generated overnight, and we sync first thing in the morning as a scheduled task.

Obviously since so few of the files change on a regular basis, we can benefit greatly from incremental copy. We have tried Rsync, but that can take as long as eight to twelve hours just to complete the "building file list" operation. It's clear that we are rapidly outgrowing what rsync is capable of (12 hour time frame is much too long).

We had been using another tool called RepliWeb to synchronize the structures, and it can do an incremental transfer in around 45 minutes. However it seems we've exceeded its limit, it has started seeing files show up as deletes when they are not (maybe some internal memory structure has been exhausted, we're not sure).

Has anyone else run into a large scale synchronization project of this sort? Is there something designed to handle massive file structures like this for synchronization?

Best Answer

If you can trust the filesystem last-modified timestamps, you can speed things up by combining Rsync with the UNIX/Linux 'find' utility. 'find' can assemble a list of all files that show last-modified times within the past day, and then pipe ONLY that shortened list of files/directories to Rsync. This is much faster than having Rsync compare the metadata of every single file on the sender against the remote server.

In short, the following command will execute Rsync ONLY on the list of files and directories that have changed in the last 24 hours: (Rsync will NOT bother to check any other files/directories.)

find /local/data/path/ -mindepth 1 -ctime -0 -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.

In case you're not familiar with the 'find' command, it recurses through a specific directory subtree, looking for files and/or directories that meet whatever criteria you specify. For example, this command:

find . -name '\.svn' -type d -ctime -0 -print

will start in the current directory (".") and recurse through all sub-directories, looking for:

  • any directories ("-type d"),
  • named ".svn" ("-name '.svn'"),
  • with metadata modified in the last 24 hours ("-ctime -0").

It prints the full path name ("-print") of anything matching those criteria on the standard output. The options '-name ', '-type ', and '-ctime ' are called "tests", and the option '-print' is called an "action". The man page for 'find' has a complete list of tests and actions.

If you want to be really clever, you can use the 'find' command's '-cnewer ' test, instead of '-ctime ' to make this process more fault-tolerant and flexible. '-cnewer' tests whether each file/directory in the tree has had its metadata modified more recently than some reference file. Use 'touch' to create the NEXT run's reference file at the beginning of each run, right before the 'find... | rsync...' command executes. Here's the basic implementation:

#!/bin/sh
curr_ref_file=`ls /var/run/last_rsync_run.*`
next_ref_file="/var/run/last_rsync_run.$RANDOM"
touch $next_ref_file
find /local/data/path/ -mindepth 1 -cnewer $curr_ref_file -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.
rm -f $curr_ref_file

This script automatically knows when it was last run, and it only transfers files modified since the last run. While this is more complicated, it protects you against situations where you might have missed running the job for more than 24 hours, due to downtime or some other error.