Backup Performance – How to Make Rsync of 2M Files from Remote Server Efficient

We have a large amount of files on a remote server that I'd like to setup regular backups to a local system for extra redundancy. Some details:

Remote system is not in my control. I only have SSH/rsync or FTP access
Remote system runs rsync 2.6.6 and cannot be upgraded
Remote system allows a max of 25 concurrent connections and 5 are reserved for production needs (so, 20 available)
Remote system contains 2M files – the majority of which are 100-200K in size
Files are stored in a hierarchy

Similar to:

0123456789/
        0123456
            abc/
                1.fff
                2.fff
                3.fff
            xyz/
                9.fff
                8.fff
                7.fff
9877656578/
        5674563
            abc/
                1.fff
                2.fff
                3.fff
            xyz/
                9.fff
                8.fff
                7.fff

with 10's of thousands of those root folders containing just a few of the internal folder/file structures – but all root folders are numeric (0-9) only.

I ran this with a straight rsync -aP the first time and it took 3196m20.040s. This is partially due to the fact that since the remote server is on rsync 2.6.6 I can't use the incremental file features found in 3.x.x. It takes almost 12 hours to compile the file list – running about 500 files per 10 seconds. I don't anticipate subsequent runs will take as long because the initial run had to download everything anew – however even 12 hours just for the file listing is too long.

The folder naming is broken up as such:

$ ls | grep "^[^67]" | wc -l
295
$ ls | grep "^6" | wc -l
14167
$ ls | grep "^7" | wc -l
14414

I've tested running this rsync -aWP --delete-during by breaking it up using --include="/0*/" --exclude="/*/" where I run 8 of these concurrently with 0* 1* 2* 3* 4* 5* 8* 9* and for 6 and 7 I use 60*–69* and 70*-79* because the brunt of the folders in the hierarchy begin with 6 or 7 (roughly 1400 per 6?* or 7?*).

Everything that's not a 6 or 7 takes about 5 minutes, total. The 6/7 directories (broken down in 1/10ths) take 15 minutes each.

This is quite performant, except to run this job I have to run 28 concurrent rsync and this saturates the available connection count – not to mention potentially saturating the network.

Does anyone have a recommendation for another variant of rsync or some additional options I could add to prevent this from using so many connections concurrently without having to stage this sequentially in the bounds of rsync 2.6.6 on one end?

Edit #1: We do pay for bandwidth to/from this external provider so ideally we would only send things over the wire that need to be sent, and nothing more.

Best Answer

After an initial sync time of 40 hours to download and sync all of the data a subsequent scan and sync of the same data (just to pull in updates) only took 6.5 hours. The command used to run the rsync was:

rsync -a --quiet USER@REMOTE_SERVER:ROOT/FOLDER/PATH/ /LOCAL/DESTINATION

I think my large initial time for download was twofold:

The initial dataset is 270GB and ~2M files, which is a lot to scan and download over the internet (in our case we have a 100mbit synchronous connection and this was connecting to a large CDN provider)
I had the -P option enabled and -v options on the initial sync which caused a lot of local console chatter displaying every file being synced and progress information.

So, the answer here: Just use rsync with not so many verbosity options (and --quiet ideally) and it's quite efficient - even to huge datasets.

Best Answer

Related Solutions

How to make rsync use sudo

Linux – rsync –files-from specifies folders but does not copy files, recursive does not fix it

Related Topic