MegaRAID storage manager update and now tons of media errors

megaraidraid10

I was just poking around a 5 year old server and noticed the MegaRAID storage manager (14.08.01) appeared to be not responding. The server has been running for something like 400 days without rebooting.

I didn't want to reboot it so I installed the new version (17.05.00) and it seemed to go in fine. Immediately upon launching MSM it started to find "Unexpected sense unrecovered read error" on disk 0.

I ordered an express RMA drive from WD and then launched a consistency check. Now I am seeing the same error (but far less frequently) on another drive as well. I have four drives in RAID 10 plus one hot spare. One of the drives has 156 media errors and the other has 10. Am I screwed?

Should I Fail the drive that has the most media errors and try to rebuild?

Best Answer

Check your filesystems after repairing your array, in case there was silent data corruption.

You can lose two entire drives in a four drive RAID 10. Depending on which of those drives are failing, you may not be screwed one bit. Make sure both of those drives are members of opposite RAID 1 arrays. If they are, you're almost certainly fine. You also have a hot spare, and that should act as a "spillover" space for most controllers - though I'm not certain if your controller will do this because I don't know what it is.

Even if your controller does not use a hot spare as scratch space or emergency space it should still have been doing patrol reads regularly, which may have detected these issues and relocated data areas. Your controller log would be a good place to see if that's happened during at least the last few patrol reads. I've no idea how old these media errors are, though.

Regarding your adapter, if you're not running manufacturer "certified" drives in your controller, your controller won't necessarily be so intelligent about ejecting members when they begin to fail - typically only being able to eject them when they drop out or report a serious SMART failure. However, a drive can have been going bad for quite some time before triggering its overall SMART health report.

Even if it's not fine, perform the rebuild and do a consistency check + filesystem check. You'll also see filesystem I/O errors in dmesg if you've actually been running into filesystem level corruption. Worst case, you'll need to restore some files or the whole array from backup. Do the rebuild one disk at a time, not both. Start with replacing the most ragged disk.

Related Topic