Is rsync a good candidate for failover implementation (very large dataset)

failoverlarge-datarsync

I have a large a set of data (+100 GB) which can be stored into files. Most of the files would be in the 5k-50k range (80%), then 50k – 500k (15%) and >500k (5%). The maximum expected size of a file is 50 MB. If necessary, large files can be split into smaller pieces. Files can be organized in a directory structure too.

If some data must be modified, my application make a copy, modifies it and if successful, flags it as the latest version. Then, old version is removed. It is crash safe (so to speak).

I need to implement a failover system to keep this data available. One solution is to use a Master-Slave database system, but these are fragile and force a dependency on the database technology.

I am no sysadmin, but I read about the rsync instruction. It looks very interesting. I am wondering if setting some failover nodes and use rsync from my master is a responsible option. Has anyone tried this before successfully?

i) If yes, should I split my large files? Is rsync smart/efficient at detecting which files to copy/delete? Should I implement a specific directory structure to make this system efficient?

ii) If the master crashes and a slave takes over for an hour (for example), is making the master up-to-date again as simple as running rsync the other way round (slave to master)?

iii) Bonus question: Is there any possibility of implementing multi-master systems with rsync? Or is only master slave possible?

I am looking for advice, tips, experience, etc… Thanks !!!

Best Answer

Is rsync smart/efficient at detecting which files to copy/delete?

Rsync is extremely efficient at detecting and updating files. Depending on how your files change, you might find a smaller number of large files are far easier to sync then lots of small files. Depending on what options you choose, on each run it is going to stat() every file on both sides, and then transfer the changes if the files are different. If only a small number of your files are changing, then this step to look for changed files can quite expensive. A lot of factors come into play about how long rsync takes. If you are serious about trying this you should do a lot of testing on real data to see how things work.

If the master crashes and a slave takes over for an hour (for example), is making the master up-to-date again as simple as running rsync the other way round (slave to master)?

Should be.

Is there any possibility of implementing multi-master systems with rsync?

Unison, which uses the rsync libraries allows a bi-directional sync. It should permit updates on either side. With the correct options it can identify conflicts and save backups of any files where a change was made on both ends.

Without knowing more about the specifics I can't tell you with any confidence this is the way to go. You may need to look at DRBD, or some other clustered device/filesystem approach which will sync things at a lower level.

Related Solutions

Windows Rsync – Windows Rsync That Supports Long File Names or a Good Alternative

I may be missing the point but have you considered using Robocopy on Windows. It is similar to RSync however you can't schedule it from the application directly.

This can be overcome by writing a batch file for the copy and then creating a Scheduled Task. Robocopy is free and extremely robust. I often use it to copy files between Linux and Windows using Samba and the network, and the resume capability of Robocopy is really powerful.

Linux – Copying a large directory tree locally? cp or rsync

I would use rsync as it means that if it is interrupted for any reason, then you can restart it easily with very little cost. And being rsync, it can even restart part way through a large file. As others mention, it can exclude files easily. The simplest way to preserve most things is to use the -a flag – ‘archive.’ So:

rsync -a source dest

Although UID/GID and symlinks are preserved by -a (see -lpgo), your question implies you might want a full copy of the filesystem information; and -a doesn't include hard-links, extended attributes, or ACLs (on Linux) or the above nor resource forks (on OS X.) Thus, for a robust copy of a filesystem, you'll need to include those flags:

rsync -aHAX source dest # Linux
rsync -aHE source dest  # OS X

The default cp will start again, though the -u flag will "copy only when the SOURCE file is newer than the destination file or when the destination file is missing". And the -a (archive) flag will be recursive, not recopy files if you have to restart and preserve permissions. So:

cp -au source dest

Best Answer

Related Solutions

Windows Rsync – Windows Rsync That Supports Long File Names or a Good Alternative

Linux – Copying a large directory tree locally? cp or rsync

Related Topic