Tar File Checksum – Creating a Tar File with Checksums Included

archivechecksumlinux

Here's my problem : I need to archive to tar files a lot ( up to 60 TB) of big files (usually 30 to 40 GB each). I would like to make checksums ( md5, sha1, whatever) of these files before archiving; however not reading every file twice (once for checksumming, twice for tar'ing) is more or less a necessity to achieve a very high archiving performance (LTO-4 wants 120 MB/s sustained, and the backup window is limited).

So I'd need some way to read a file, feeding a checksumming tool on one side, and building a tar to tape on the other side, something along :

tar cf - files | tee tarfile.tar | md5sum -

Except that I don't want the checksum of the whole archive (this sample shell code does just this) but a checksum for each individual file in the archive.

I've studied GNU tar, Pax, Star options. I've looked at the source from Archive::Tar. I see no obvious way to achieve this. It looks like I'll have to hand-build something in C or similar to achieve what I need. Perl/Python/etc simply won't cut it performance-wise, and the various tar programs miss the necessary "plugin architecture". Does anyone know of any existing solution to this before I start code-churning ?

Best Answer

Before going ahead and rewriting tar, you may want to profile the quick-and-easy method of reading the data twice, as it may not be much slower than doing it in one pass.

The two pass method is implented here:

http://www.g-loaded.eu/2007/12/01/veritar-verify-checksums-of-files-within-a-tar-archive/

with the one-liner:

  tar -cvpf mybackup.tar myfiles/| xargs -I '{}' sh -c "test -f '{}' && 
  md5sum '{}'" | tee mybackup.md5

While its true that md5sum is reading each file from disk in parallel with tar, instead of getting the data streamed through the pipe, Linux disk cacheing should make this second read a simple read from a memory buffer, which shouldn't really be slower than a stdin read. You just need to make sure you have enough space in your disk cache to store enough of each file that the 2nd reader is always reading from the cache and not getting far enough behind to have to retrieve from disk

Related Solutions

File corruption (bad checksums) in large files copied to VMware guest

So this was a combination of a bad stick of RAM and a Linux kernel bug affecting SATA. I'd put Ubuntu 10.04 on there, and eventually left memtest86+ running all night (as running it for 1.5 passes before hadn't flushed out the problem).

After I removed the bad RAM, I started seeing SATA errors in /var/syslog, similar to this:

Dec  8 14:56:17 george kernel: [   36.442340] ata4.00: exception Emask 0x10 SAct 0x4 SErr 0x4010000 action 0xe frozen 
Dec  8 14:56:17 george kernel: [   36.442355] ata4.00: irq_stat 0x00400040, connection status changed 
Dec  8 14:56:17 george kernel: [   36.442366] ata4: SError: { PHYRdyChg DevExch } 
Dec  8 14:56:17 george kernel: [   36.442375] ata4.00: failed command: READ FPDMA QUEUED 
Dec  8 14:56:17 george kernel: [   36.442388] ata4.00: cmd 60/08:10:88:a9:87/00:00:1b:00:00/40 tag 2 ncq 4096 in 
Dec  8 14:56:17 george kernel: [   36.442389]          res 40/00:64:30:aa:8b/00:00:12:00:00/40 Emask 0x10 (ATA bus error) 
Dec  8 14:56:17 george kernel: [   36.442408] ata4.00: status: { DRDY } 
Dec  8 14:56:17 george kernel: [   36.442418] ata4: hard resetting link 
Dec  8 14:56:23 george kernel: [   41.724689] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) 
Dec  8 14:56:24 george kernel: [   42.445422] ata4.00: configured for UDMA/133 
Dec  8 14:56:24 george kernel: [   42.445432] ata4: EH complete

I finally discovered this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892?comments=all which led me to try an earlier Linux kernel (the one that ships with Ubuntu 8.04). The machine's been working great ever since.

Daily rsync backups with hard links, checksums, and a new computer

I think you are misunderstanding both the checksum and hard link options.

The --checksum option is described in the man page as "skip based on checksum, not mod-time & size". It means that mod time and size are basically ignored, but it does mean that all files are read on both sides (because it has to read the file to compute the checksums.

It's important to realize that rsync does this anyway if the time and size are different. So --checksum causes much more work (reading every file), than without it. Without it, the checksums are done only if the mod time or size are different. As said above, this only influences what files to skip.

--checksum is typically used in backup scripts for the equivalent of a "full backup", say once a month. This ensures that any file which may have changed, but in such a way that the mod time and size remain the same, get correctly backed up.

The --hard-links option (from the man page): "This tells rsync to look for hard-linked files in the transfer". Note that it is only in the transfer, so it won't detect that you have an existing copy of the data on the rsync server, in another location, and hard link it. It only links files that are being transferred with other files that have previously been transferred.

So, if you want the new laptop's backup directory to be hard-linked to the old laptop's backup directory, you will need to remove the new laptop's backup directory, and re-create it using hard links (say, via cp -al). However, if all of your file dates have changed, you're likely going to run into issues with rsync re-transferring these files and breaking those hard links. You'd first probably need to rsync the one laptop to the other, being careful not to rsync over data that truly needs to be different between them. That way the files should have the same dates, and that will make your rsync backups happier.

I know you've said you read the man page, but I'd encourage you to look at it again, specifically the detailed descriptions of the --checksum and --hard-links options. You probably should also read about the --in-place option as well, as it may interact badly if you are trying to preserve hard links.

Best Answer

Related Solutions

File corruption (bad checksums) in large files copied to VMware guest

Daily rsync backups with hard links, checksums, and a new computer

Related Topic