File corruption (bad checksums) in large files copied to VMware guest

checksumvmware-esxiwindows-server-2008-r2

In setting up a development lab, I've got a desktop system running ESXi 4.1.0 (free license) on SATA RAID 0 (already purchased and configured when I started this job; I'm open to hardware input as it pertains to my problem.) Its guests so far include two Win2008 Server R2 64-bit VMs and on Ubuntu 10.04 64-bit VM. I'm installing onto the Windows servers.

We've been copying off some fairly large files (over a gigabyte) for an installation, hoping to install more quickly from a (virtual) hard drive than from the network for from BD-ROM. The problem is that they keep coming up with different checksums from the originals. The file sizes are the same, but md5sum reports different numbers (and so does the installer, as it refuses to continue when the checksums don't match.)

I've tried copying directly from the BD-ROM (attaching the OS drive to the host system's physical drive). I've tried copying the large files onto a co-worker's Windows machine from his Blu-Ray drive; when I do that, the checksums match. But when I copy from his machine to the VM guest over a network share, the checksums no longer match.

Thinking this meant a corrupt destination drive, I deleted it in vSphere and added another freshly created drive. The problem persists. I'm not sure what to try next.

Best Answer

So this was a combination of a bad stick of RAM and a Linux kernel bug affecting SATA. I'd put Ubuntu 10.04 on there, and eventually left memtest86+ running all night (as running it for 1.5 passes before hadn't flushed out the problem).

After I removed the bad RAM, I started seeing SATA errors in /var/syslog, similar to this:

Dec  8 14:56:17 george kernel: [   36.442340] ata4.00: exception Emask 0x10 SAct 0x4 SErr 0x4010000 action 0xe frozen 
Dec  8 14:56:17 george kernel: [   36.442355] ata4.00: irq_stat 0x00400040, connection status changed 
Dec  8 14:56:17 george kernel: [   36.442366] ata4: SError: { PHYRdyChg DevExch } 
Dec  8 14:56:17 george kernel: [   36.442375] ata4.00: failed command: READ FPDMA QUEUED 
Dec  8 14:56:17 george kernel: [   36.442388] ata4.00: cmd 60/08:10:88:a9:87/00:00:1b:00:00/40 tag 2 ncq 4096 in 
Dec  8 14:56:17 george kernel: [   36.442389]          res 40/00:64:30:aa:8b/00:00:12:00:00/40 Emask 0x10 (ATA bus error) 
Dec  8 14:56:17 george kernel: [   36.442408] ata4.00: status: { DRDY } 
Dec  8 14:56:17 george kernel: [   36.442418] ata4: hard resetting link 
Dec  8 14:56:23 george kernel: [   41.724689] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) 
Dec  8 14:56:24 george kernel: [   42.445422] ata4.00: configured for UDMA/133 
Dec  8 14:56:24 george kernel: [   42.445432] ata4: EH complete

I finally discovered this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/285892?comments=all which led me to try an earlier Linux kernel (the one that ships with Ubuntu 8.04). The machine's been working great ever since.