Keep rsync from removing unfinished source files

rsyncstorageweb-crawler

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:

$ rsync --remove-source-files speed:/var/crawldir .

but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

Best Answer

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.

If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.

The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.

How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.