Linux – How to copy large (> 1 million) number of small files between two servers

linuxmigrationrsyncscp

I need to migrate about 1TB of data comprised of smaller files (most under 100KB) to another server. I've not even completely enumerated the files but estimates are between 1-2 million.

The initial copy using SCP took over a week. Now we have to synchronize changes. Hundreds to thousands of files are added daily.

I've attemped using rsync (v3) but it is taking too long. By the time it finishes, we will be back to having data out of sync again.

I've seen similar questions here but they are a bit older and wonder if there are any new tools to help with this process.

Issues are futher complicated by the source data being on a shared iSCSI system with poor read performance.

The latest strategy may be to redo the data migration and have the developers write a tool to log all of the new files that are added during the migration process. The directory structure keys off a unique identifier is is very broad and deep, so new files are scattered within this structure and rewritting the app to put new files into a specific directory will not work.

Any strategies appreciated.

OS is RHEL 5 going to RHEL 6.

Best Answer

I'd be tempted to answer "stop abusing the file system by treating it like a database" but I'm sure it wouldn't help you much ;)

First, you need to understand that if your limitation is in the bandwidth available on read, there isn't anything you can do to improve performance using a simple synch command. In such a case, you'll have to split the data when it's written either by changing the way the files are created (which means, as you guessed correctly, asking the devs to change the source program) or by using a product that does does geo-mirroring (like, for instance double-take: check around as I'm sure you'll find alternatives, that's just an example).

In similar cases, the main cause of problem isn't typically the file data but rather the meta-data access. Your first strategy will therefore be to divide the load into multiple process that act on (completely) different directories: that should help the file system keep up with providing you with the meta-data you need.

Another strategy is to use your backup system for that: replay your last incremental backups on the target to keep the database in sync.

Finally, there are more exotics strategies that can be applied in specific cases. For instance, I solved a similar problem on a Windows site by writing a program that loaded the files into the file system every few minutes, thus keeping the FS clean.