How does RAID detect a faulty HD

raidraid1raid5

I have been looking over Raid levels over the past 3 days. And have been weighing up the pro/cons of raid controllers hardware/software. I understand that RAID is not a backup solution and I'm perfectly fine with it, though one question still remains.

How does a RAID controller, even Raid 1 to Raid 6 actually detect that a hard disk drive is failing. The research that I have done have showed that most common hard disk drive manufactures use ECC in their hard disk drive design that is suppose to protect against 1 bit failures to an extent 3 bits.

Though when thinking about this, lets say you have Raid (1) and two hard disk drives that are identical. Lets say, data is read from drive 0, and also at the same time from drive 1. Though drive 1 reports a ECC read failure to the Raid Controller.

Now this is the big question, with hardware raid what would the Raid controller do? Its got a signal from the hard disk that the read failed. It can report the hard disk drive as faulty and need replacing.

Does the Raid Controller Seeks to a different hard disk drive for the data until it gets a successfully read from the drive. (Yes, a drive can report read correct and the data can still be corrupted, and RAID does not check polarity or ECC on read)

Best Answer

I asked a NetApp engineer who was giving us a talk this very question. His answer, more or less, was:

Nobody reads the checksums on reads. There's no point. Reading a checksum means you have to read the entire slice plus checksum, then compute the checksum to verify you have the correct data. Plus the orthoganal checksum if you are running RAID-6 or whatever. It is a total performance killer because it breaks the ability to randomly seek to totally different sectors on different disks at the same time. Similarly, almost nobody reads both sides of a mirror in RAID-1 because if you only read one side you can alternate which side of the mirror you read from so that you get faster throughput, and if you suddenly have a mismatch, which disk do you take as correct and which do you take as broken? All modern RAID systems depend on the on-disk controllers to signal the RAID controller that they are in distress (through SMART or the like), at which point that disk is almost always kicked out of the array. Checksums are used for rebuilding arrays, not for read-verification.