Failing disks in RAID array – strategy suggestions required

drive-failurehard driveraid5software-raid

I have a linux based software RAID 5 array. SMART has just started to send me emails complaining that one of the 5 disks has a Current Pending Sector Count of 9 and also an Offline Uncorrectable Count of 9. I have done a lot of google-ing and the consensus seems to be that if I write the sectors with zeros, the disk will remap them and all will be well.

I did want to track down which files were affected, but I have difficulty doing the mapping as I have 5 disks in RAID 5 with LUKS encryption on top, and finally LVM on top of that. None of the research I did helped me get through that tangle.

In the end, my plan was to simply fail the drive and re-add it to make the array re-build.

Before I did that, I did 'long' tests on the other disk in the array. All were perfect apart from one which had a Reallocated Sector count of 82,82,36,764.

So 2 out of 5 drives have issues.

At this point I am a little confused as to the best approach to trying to flush these errors out, if it is at all possible.

Does anyone have any advice? I am happy to replace failing drives where necessary, but would like to try to get the data straight first.

Best Answer

This will be the general process. See the mdraid man page and your own local configuration for the exact commands to use, if you don't already know them.

  1. Pray.

  2. Verify that your backup is current. Run it manually if necessary. If you don't have backups, make one now.

  3. Fail the drive with pending sector and offline uncorrectable sectors. The other drive with reallocated sectors will live a little longer, and hopefully long enough to complete this process, but this drive is at the point where it could kill your entire array.

  4. Replace the drive. In hardware. Partition the new drive and add it to the mdraid array.

  5. Rebuild the array and wait for the rebuild to complete. In newer versions of mdraid, the rebuild will start automatically.

  6. Repeat the process with the second drive.