So this was a combination of a bad stick of RAM and a Linux kernel bug affecting SATA. I'd put Ubuntu 10.04 on there, and eventually left memtest86+ running all night (as running it for 1.5 passes before hadn't flushed out the problem).
After I removed the bad RAM, I started seeing SATA errors in /var/syslog, similar to this:
Dec 8 14:56:17 george kernel: [ 36.442340] ata4.00: exception Emask 0x10 SAct 0x4 SErr 0x4010000 action 0xe frozen
Dec 8 14:56:17 george kernel: [ 36.442355] ata4.00: irq_stat 0x00400040, connection status changed
Dec 8 14:56:17 george kernel: [ 36.442366] ata4: SError: { PHYRdyChg DevExch }
Dec 8 14:56:17 george kernel: [ 36.442375] ata4.00: failed command: READ FPDMA QUEUED
Dec 8 14:56:17 george kernel: [ 36.442388] ata4.00: cmd 60/08:10:88:a9:87/00:00:1b:00:00/40 tag 2 ncq 4096 in
Dec 8 14:56:17 george kernel: [ 36.442389] res 40/00:64:30:aa:8b/00:00:12:00:00/40 Emask 0x10 (ATA bus error)
Dec 8 14:56:17 george kernel: [ 36.442408] ata4.00: status: { DRDY }
Dec 8 14:56:17 george kernel: [ 36.442418] ata4: hard resetting link
Dec 8 14:56:23 george kernel: [ 41.724689] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 8 14:56:24 george kernel: [ 42.445422] ata4.00: configured for UDMA/133
Dec 8 14:56:24 george kernel: [ 42.445432] ata4: EH complete
I finally discovered this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892?comments=all which led me to try an earlier Linux kernel (the one that ships with Ubuntu 8.04). The machine's been working great ever since.
I think you are misunderstanding both the checksum and hard link options.
The --checksum
option is described in the man page as "skip based on checksum, not mod-time & size". It means that mod time and size are basically ignored, but it does mean that all files are read on both sides (because it has to read the file to compute the checksums.
It's important to realize that rsync does this anyway if the time and size are different. So --checksum
causes much more work (reading every file), than without it. Without it, the checksums are done only if the mod time or size are different. As said above, this only influences what files to skip.
--checksum
is typically used in backup scripts for the equivalent of a "full backup", say once a month. This ensures that any file which may have changed, but in such a way that the mod time and size remain the same, get correctly backed up.
The --hard-links
option (from the man page): "This tells rsync to look for hard-linked files in the transfer". Note that it is only in the transfer, so it won't detect that you have an existing copy of the data on the rsync server, in another location, and hard link it. It only links files that are being transferred with other files that have previously been transferred.
So, if you want the new laptop's backup directory to be hard-linked to the old laptop's backup directory, you will need to remove the new laptop's backup directory, and re-create it using hard links (say, via cp -al
). However, if all of your file dates have changed, you're likely going to run into issues with rsync re-transferring these files and breaking those hard links. You'd first probably need to rsync the one laptop to the other, being careful not to rsync over data that truly needs to be different between them. That way the files should have the same dates, and that will make your rsync backups happier.
I know you've said you read the man page, but I'd encourage you to look at it again, specifically the detailed descriptions of the --checksum
and --hard-links
options. You probably should also read about the --in-place
option as well, as it may interact badly if you are trying to preserve hard links.
Best Answer
Before going ahead and rewriting tar, you may want to profile the quick-and-easy method of reading the data twice, as it may not be much slower than doing it in one pass.
The two pass method is implented here:
http://www.g-loaded.eu/2007/12/01/veritar-verify-checksums-of-files-within-a-tar-archive/
with the one-liner:
While its true that md5sum is reading each file from disk in parallel with tar, instead of getting the data streamed through the pipe, Linux disk cacheing should make this second read a simple read from a memory buffer, which shouldn't really be slower than a stdin read. You just need to make sure you have enough space in your disk cache to store enough of each file that the 2nd reader is always reading from the cache and not getting far enough behind to have to retrieve from disk