File corruption (bad checksums) in large files copied to VMware guest

In setting up a development lab, I've got a desktop system running ESXi 4.1.0 (free license) on SATA RAID 0 (already purchased and configured when I started this job; I'm open to hardware input as it pertains to my problem.) Its guests so far include two Win2008 Server R2 64-bit VMs and on Ubuntu 10.04 64-bit VM. I'm installing onto the Windows servers.

We've been copying off some fairly large files (over a gigabyte) for an installation, hoping to install more quickly from a (virtual) hard drive than from the network for from BD-ROM. The problem is that they keep coming up with different checksums from the originals. The file sizes are the same, but md5sum reports different numbers (and so does the installer, as it refuses to continue when the checksums don't match.)

I've tried copying directly from the BD-ROM (attaching the OS drive to the host system's physical drive). I've tried copying the large files onto a co-worker's Windows machine from his Blu-Ray drive; when I do that, the checksums match. But when I copy from his machine to the VM guest over a network share, the checksums no longer match.

Thinking this meant a corrupt destination drive, I deleted it in vSphere and added another freshly created drive. The problem persists. I'm not sure what to try next.

Dec 8 14:56:17 george kernel: [ 36.442340] ata4.00: exception Emask 0x10 SAct 0x4 SErr 0x4010000 action 0xe frozen Dec 8 14:56:17 george kernel: [ 36.442355] ata4.00: irq_stat 0x00400040, connection status changed Dec 8 14:56:17 george kernel: [ 36.442366] ata4: SError: { PHYRdyChg DevExch } Dec 8 14:56:17 george kernel: [ 36.442375] ata4.00: failed command: READ FPDMA QUEUED Dec 8 14:56:17 george kernel: [ 36.442388] ata4.00: cmd 60/08:10:88:a9:87/00:00:1b:00:00/40 tag 2 ncq 4096 in Dec 8 14:56:17 george kernel: [ 36.442389] res 40/00:64:30:aa:8b/00:00:12:00:00/40 Emask 0x10 (ATA bus error) Dec 8 14:56:17 george kernel: [ 36.442408] ata4.00: status: { DRDY } Dec 8 14:56:17 george kernel: [ 36.442418] ata4: hard resetting link Dec 8 14:56:23 george kernel: [ 41.724689] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Dec 8 14:56:24 george kernel: [ 42.445422] ata4.00: configured for UDMA/133 Dec 8 14:56:24 george kernel: [ 42.445432] ata4: EH complete

Best Answer

So this was a combination of a bad stick of RAM and a Linux kernel bug affecting SATA. I'd put Ubuntu 10.04 on there, and eventually left memtest86+ running all night (as running it for 1.5 passes before hadn't flushed out the problem).

After I removed the bad RAM, I started seeing SATA errors in /var/syslog, similar to this:

I finally discovered this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892?comments=all which led me to try an earlier Linux kernel (the one that ships with Ubuntu 8.04). The machine's been working great ever since.

Best Answer

Related Solutions

Tar File Checksum – Creating a Tar File with Checksums Included

Guest OS Support for VMWare ESXi 4 on an IBM xSeries 366 Server

Related Topic