Linux – Detection and Handling of Errors in Hard Disk/SSDs

hard drivelinuxmdadmraidssd

When an error occurs on a drive, is it correct to assume that it will always be detected and reported to the OS (if software RAID such as mdadm) or RAID controller (if hardware RAID) as a failed read (i.e. it won't silently return corrupted data), and then the the RAID software/controller will take that fact and use the other drive(s) in the RAID to read the data instead (assuming it's a RAID type that has redundancy)?

From what I understand, modern enterprise-grade drives have error detection schemes in place, so I'm assuming this is the case, but had trouble finding out anything conclusive online. I imagine this answer hinges to a degree around the quality of the error detection built into the drive, so if it matters, I'm most interested in this with regards to the Intel DC S3500 series SSDs.

EDIT 5-Jun-2015 – clarification:

Specifically, I'm wondering if the algorithms used today for detection of errors bulletproof. In a simple example, if error detection was just doing an XOR on all the bits in the sector, then if two bits got flipped, the error wouldn't be detected. I imagine they are way more advanced than that, but I wonder what the odds of an error going undetected is and if it's so low that we need not even worry about it, and if there's some authoritative source or trustworthy article on this somewhere that could be cited.

EDIT 10-Jun-2015

Updated the question title and the question body to make it more generic to the idea of disk errors (not centered around mdadm like it originally was).

Best Answer

Hard drives do have a multitude of error correction methods in place to prevent data corruption. Hard drives are divided into sectors, from which some may become completely unwritable / unreadable or return wrong data through data corruption - let's call the first bad sector corruption and the latter silent data corruption.

Bad Sector Corruption

The first corruption is already handled by the drive itself through a multitude of ways. At the factory, every manufactured drive is tested for bad sectors, which are put into a Primary Defect List (p-list). During the normal usage of the drive, the internal systems may find more bad sectors through the normal wear and tear - these are put into the Grown Defect List (g-list). Some drives have even more lists, but these two are the most common ones.

The drive itself counters these errors by remapping the access to the hard drives sectors to spare sectors without notifying the operating system. However, every time a remap happens, the appropriate values in the hard drives SMART system are increased, thus indicating a growing wear of the hard drive. The indicator to look for is SMART 5 - Reallocated Sector Count, while other important ones are 187 (Reported Uncorrectable Errors), 197 (Current Pending Sector Count) and 198 (Offline Uncorrectable).

To find bad sectors, hard drives use internal error correction codes (ECC), which can be used to determine the integrity of data in a specific sector. That way, it can check for write and read errors in a sector and update the g-list if necessary.

Sources

Silent Data Corruption

Since we do have quite a lot of internal data integrity checking, silent data corruption should be very uncommon - after all, since hard drives have the task of reliably persisting data, they should do that one job correctly.

To keep the amount of silent data corruption outside of a user requested read or write minimal, RAID systems periodically check the ECCs of the complete drives to update the g-list (data scrubbing). If an error occurs, the data is reconstructed from another RAID member after checking the sectors ECC.

However, all the data correction and integrity checking has to be done somewhere - the firmware. Errors in these low-level programs may still lead to problems, as might mechanical problems and false positives ECC sums. An example would be an unchecked write, where the firmware erroneously reports a successful write, while the actual harddrive write did not occur or was faulty (an identity discrepancy).

There are some studies on the statistical occurence of these failures, where a file system data integrity check did report a failure without the underlying drive reporting a problem, thus showing a silent data corruption.

TLDR: less than 0.3% in consumer disks and less than 0.02% in enterprise disks on average contained such identity discrepancies over a 17 month time span with 1.5 million disks checked (365 disks in total had identity discrepancies) - see Table 10 and Section 5 in this publication.

Sources

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf

Related Solutions

Linux Hardware – How to Physically Identify a Failed Hard Drive

I had this exact problem on a (tower) server just like you explain, and it was easy:

smartctl will output the serial number of the drive

Vendors sometimes ship their own specific tools, like hdparm, that will do the same.

So output the serial of the bad drive, and then use a dentist's mirror and a flashlight to find the drive.

On a rackmount you'll usually have indicator lights like other people have said, but I bet the same would apply.

Linux – mdadm raid1 fails to resync

You shouldn't attempt to prepare the new drive in any meaningful way unless your raid constituents are actually disk PARTITIONS not disks themselves. In which case, you would create a partition on the new drive that is the same size as the one on the remaining active disk.

You never need to touch the old drive at all -- it's assumed to be failed and unreliable.

The correct procedure is to remove the broken drive, add a new, empty drive, and then use mdadm to add that new drive to the array. You'd do it something like this:

mdadm --add /dev/md0 /dev/<newdrive>

The kernel will then sync the new drive into the array, copying the data from the one remaining good drive.

Best Answer

Related Solutions

Linux Hardware – How to Physically Identify a Failed Hard Drive

Linux – mdadm raid1 fails to resync

Related Topic