RAID Controller – How to Diagnose Spurious Failures

dell-percraid

I have a Dell T7500 with a PERC H710P connected to 4 3T drives in a RAID5 array. Also connected to the controller are 2 256G SSD drives, not configured in an array. A Linux server is installed on one of the SSD drives, and the RAID5 is where all my user data are stored.

The other day upon boot, the RAID BIOS reported errors

Drives 01 and 03 missing
Foreign config available

I loaded the foreign config, and the drives reappeared. On the next boot, I got

Drive 01 offline

Thinking the drive was bad, I replaced it with a new drive and rebuilt drive 01. When I next booted, the system came up OK, but a few reboots later I got

Drive 00 offline
Foreign config available

So I read in the Foreign config and forced 00 online.

After several reboots I then got

Drive 03 offline
Foreign config available

Read in foreign config. Force drive 03 online.

Now the system comes up OK. I have rebooted it many times.

Should I assume that my controller is bad?

Or said another way, is there any possibility that this kind of behavior can be caused by something other than the controller? For example, can the kernel driver muck up the driver configuration somehow?

Best Answer

Yes, I believe either your controller or the raid backplane is bad. But I think the controller is the culprit. Can you look up the firmware version of the RAID controller (not to be confused with the system BIOS, which you should also check) and compare to what is available on Dell's site? You may find the version is quite old and that critical issues have been resolved in newer versions. Alternatively you could try calling Dell support - which you should certainly do if support is available! You can easily check what service contract is in force by looking up the Service Tag at support.dell.com.

Two notes of caution. You are in dangerous territory. Upgrading the RAID controller firmware can sometimes result in data loss - make sure the new version has been out for awhile, and read the release notes carefully. 2) RAID 5 doesn't give you a lot of wiggle room. Either way prepare to back up your critical data before you let time pass on this issue or take any substantial corrective actions!