Tar File Checksum – Creating a Tar File with Checksums Included

archivechecksumlinux

Here's my problem : I need to archive to tar files a lot ( up to 60 TB) of big files (usually 30 to 40 GB each). I would like to make checksums ( md5, sha1, whatever) of these files before archiving; however not reading every file twice (once for checksumming, twice for tar'ing) is more or less a necessity to achieve a very high archiving performance (LTO-4 wants 120 MB/s sustained, and the backup window is limited).

So I'd need some way to read a file, feeding a checksumming tool on one side, and building a tar to tape on the other side, something along :

tar cf - files | tee tarfile.tar | md5sum -

Except that I don't want the checksum of the whole archive (this sample shell code does just this) but a checksum for each individual file in the archive.

I've studied GNU tar, Pax, Star options. I've looked at the source from Archive::Tar. I see no obvious way to achieve this. It looks like I'll have to hand-build something in C or similar to achieve what I need. Perl/Python/etc simply won't cut it performance-wise, and the various tar programs miss the necessary "plugin architecture". Does anyone know of any existing solution to this before I start code-churning ?

Best Answer

Before going ahead and rewriting tar, you may want to profile the quick-and-easy method of reading the data twice, as it may not be much slower than doing it in one pass.

The two pass method is implented here:

http://www.g-loaded.eu/2007/12/01/veritar-verify-checksums-of-files-within-a-tar-archive/

with the one-liner:

  tar -cvpf mybackup.tar myfiles/| xargs -I '{}' sh -c "test -f '{}' && 
  md5sum '{}'" | tee mybackup.md5

While its true that md5sum is reading each file from disk in parallel with tar, instead of getting the data streamed through the pipe, Linux disk cacheing should make this second read a simple read from a memory buffer, which shouldn't really be slower than a stdin read. You just need to make sure you have enough space in your disk cache to store enough of each file that the 2nd reader is always reading from the cache and not getting far enough behind to have to retrieve from disk

Related Topic