Linux – How to make a Linux software RAID1 detect disc corruption

corruptionlinuxraid1software-raid

This is one of the nightmare days: A virtualized server running on a Linux SW-RAID1 runs a VM that exhibits random segfaults in seemingly random codechunks.

While debugging I find that a file gives different md5sums on each and every run. Digging deeper I find this: The raw disc partitions that make up the RAID1 mirror contain 2 bit-differences and ca. 9 sectors are completely empty on one disc and filled with data on the other disc.

Obviously Linux gives back a sector from a undeterministically chosen disc of the mirror set. So sometimes the same sector is returned OK, sometimes the corrupted is given back.

The docs say:

RAID cannot and is not supposed to guard against data corruption on the media. Therefore, it doesn't make any sense either, to purposely corrupt data (using dd for example) on a disk to see how the RAID system will handle that. It is most likely (unless you corrupt the RAID superblock) that the RAID layer will never find out about the corruption, but your filesystem on the RAID device will be corrupted.

Thanks. That will help me sleep. :-/

Is there a way to have Linux at least detect this corruption by using sector checksumming or something like that? Would this be detected in a RAID5 setup? Is this the moment I wish I used ZFS or btrfs (once it becomes usable without uber-admin capabilities)?

Edit: I am not alone.

Best Answer

You can force a check of (eg) md0 with

echo "check" > /sys/block/md0/md/sync_action

You can check the state of the test with

cat /sys/block/md0/md/sync_action

while it returns check the check is running, once it returns idle you can do a

cat /sys/block/$dev/md/mismatch_cnt

to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.

So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.

I'd add that I agree with the docs when they say that

RAID cannot and is not supposed to guard against data corruption on the media

RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.

Assuming that your /dev/md5 was never used in the LVM:

(...had you ever looked at pvscan before today?)

If you don't have backups, now is the time to start. If you do, now is the time to test them (and if they don't work, you don't have backups, see step 1).

There isn't an easy way out of this mess, and I haven't got a clue what might happen if you reboot at this point (can you unmount the filesystem?). If I was certain that what really happened was that sdj had been added as both a raid drive and as an lvm physical volume (since the lvm wasn't using the raid driver to write to sdj, none of the data written to sdj would be on sdk... perhaps this can be verified by comparing hex dumps of various chunks of /dev/sdj and /dev/sdk and someone smarter than me who knows good places to look for things that would say "this is XFS" versus "this is random gibberish or a blank drive"?), then what I'd do is this:

Start by trying to get SMART data on sdk to see if it is trustworthy or on the way out.

If sdk is good, then I would thank my lucky stars for the former admin having wasted 63GB of /dev/sdj.

fdisk /dev/sdk

(doublecheck EVERYTHING before hitting return). Have fdisk create a partition table and an md partition (mdadm manpage says use 0xDA, but every walkthrough and my own experience says 0xFD for raid autodetect), then

mdadm --create /dev/md6 --level=1 --raid-devices=2 missing /dev/sdk1

(doublecheck EVERYTHING before hitting return). This will create a degraded raid1 array named md6 using the partition we made on sdk. These next steps are why that wasted space is important: we've lost some space due to the md superblock and due to the partition table, so our /dev/md6 is slightly smaller than /dev/sdj was. We're going to add /dev/md6 to the dedvol volume group and instruct LVM to move the 1.82TB of logical volume from /dev/sdj to /dev/md6. LVM can handle the filesystem being active while it does this.

pvcreate /dev/md6
vgextend dedvol /dev/md6
pvmove -v /dev/sdj

(doublecheck... you get the picture. I'd also run pvscan after pvcreate and again after vgextend to make sure things look right). This will begin the process of moving all the data allocated to /dev/sdj to /dev/md6 (specifically, the command moves everything off sdj, and md6 is the only place for it to go). Several hours later either this will complete or the system will lock up trying to read from sdj. If the system crashes, you can reboot and try pvmove without a device name to restart at the last checkpoint or just give up and reinstall from backups.

If we succeed, we remove /dev/sdj from the volume group, then remove it as a physical volume:

vgreduce dedvol /dev/sdj
pvremove /dev/sdj

Now, for the corruption-checking part. The tool for checking and fixing xfs is xfs_repair (fsck will run on an xfs filesystem but it does nothing at all). The bad news? It uses gigs of RAM per terabyte of filesystem, so hopefully you have a 64 bit server with a 64 bit kernel and the 64 bit xfs_repair binary (which might be named xfs_repair64) and at least 10GB of RAM+Swap (you should be able to use some of that leftover empty space in dedvol to create a swap volume, then mkswap that volume, then swapon that volume). The filesystem must be unmounted before running xfs_repair on it. Also, xfs_repair can detect and (attempt to) fix damage to the filesystem itself, but it may not detect damage to the data (for instance, something overwriting part of a directory inode versus something overwritten in the middle of a text file).

Finally, we need to buy a new /dev/sdj, install it, and add it to that degraded /dev/md6, keeping in mind that if we reboot the computer without sdj in it, it is possible sdk will move down to sdj and the new drive will be sdk instead (probably not, but best to be sure):

fdisk /dev/sdj

check to make sure that it isn't the drive we partitioned and set up already, then create a partition for md on it

mdadm /dev/md6 -a /dev/sdj1

(It is entirely possible that the errors could be due to raid and lvm duking it out over the content of sdj, rather than the drive actually failing (usually failing drives generate a lot of gibberish from the driver in dmesg rather than just Input/Output errors) but I'm not sure I'd risk it.)

Linux – How to mount a RAID1 disk as a normal disk

Presuming it is a disk in software RAID, you can start a RAID1 array with just 1 disk, then mount the array.

Any decent hardware RAID controller should also be able to do that.

Best Answer

Related Solutions

Linux – /dev/md device disappeared in Linux RAID1 array

Assuming that your /dev/md5 was never used in the LVM:

Linux – How to mount a RAID1 disk as a normal disk

Related Topic