Reliable file copy (move) process – mostly Unix/Linux

filesperlrsync

Short story : We have a need for a rock-solid reliable file mover process. We have source directories that are often being written to that we need to move files from. The files come in pairs – a big binary, and a small XML index. We get a CTL file that defines these file bundles. There is a process that operates on the files once they are in the destination directory; that gets rid of them when it's done. Would rsync do the best job, or do we need to get more complex? Long story as follows :

We have multiple sources to pull from : one set of directories are on a Windows machine (that does have Cygwin and an SSH daemon), and a whole pile of directories are on a set of SFTP servers (Most of these are also Windows.) Our destinations are a list of directories on AIX servers.

We used to use a very reliable Perl script on the Windows/Cygwin machine when it was our only source. However, we're working on getting rid of that machine, and there are other sources now, the SFTP servers, that we cannot presently run our own scripts on.

For security reasons, we can't run the copy jobs on our AIX servers – they have no access to the source servers. We currently have a homegrown Java program on a Linux machine that uses SFTP to pull from the various new SFTP source directories, copies to a local tmp directory, verifies that everything is present, then copies that to the AIX machines, and then deletes the files from the source. However, we're finding any number of bugs or poorly-handled error checking. None of us are Java experts, so fixing/improving this may be difficult.

Concerns for us are:

  • With a remote source (SFTP), will rsync leave alone any file still being written? Some of these files are large.
  • From reading the docs, it seems like rysnc will be very good about not removing the source until the destination is reliably written. Does anyone have experience confirming or disproving this?
  • Additional info We will be concerned about the ingestion process that operates on the files once they are in the destination directory. We don't want it operating on files while we are in the process of copying them; it waits until the small XML index file is present. Our current copy job are supposed to copy the XML file last.
  • Sometimes the network has problems, sometimes the SFTP source servers crap out on us. Sometimes we typo the config files and a destination directory doesn't exist. We never want to lose a file due to this sort of error.
  • We need good logs

If you were presented with this, would you just script up some rsync? Or would you build or buy a tool, and if so, what would it be (or what technologies would it use?) I (and others on my team) are decent with Perl.

Best Answer

Edit: Rsync does an end-to-end check: after the file is transfered it calculates the checksum of that file on the destination and compares it to the checksum on the source. When the the checksums match, only then it declares the transfer successful. This is reflected in the final exit status code - if ALL transfered files passed the test then the exit code will be 0 (Success).

In a similar setup I scripted-up my own solution based on rsync. It was for nightly backups and we do not delete files automatically.

To address some of your concerns:

  • Rsync never modifies anything on the source side (unless you use the --remove-source-files option).
  • If the network goes down for a long time Rsync will give up and give an appropriate exit status. I check this in my script and for specific exit codes (which I observed in practice by logging) I have the script re-try the rsync command up to 3 times.
  • Yes, your script should log as much as possible. Timestamp, Total Running time, Rsync exist status, Rsync --stats output (amount transmitted). I also run find at the end of the transfer to count the number of files and du * to get sizes of directories and log that.

Basically you need to take care of a few things in the script. Mainly: Gathering the exit status, some statistics, and removing source files on a successful transfer.

You can trust rsync's exit status that all the requested files were transfered but you should think about how much do you trust your script to provide rsync with the right files (source directory) before you delete them on the source machine. Maybe counting the files with find on the source and then on the destination (and then checking if those numbers match) would be a good final check before your script deletes the files automatically.

Give it a 10 to 20 tries to develop and test your script. You would need to install Cygwin with rsync and ssh clients on the Windows machines.

It's good to feel confidant about a application like that by knowing exactly how it works. I never used a commercial backup software - but if you can find a rock-solid one and trust it - then go for that - it could save you a lot of time.