How to Detect Hard Disk Failure

centos6.2hardwareraid1software-raid

I have a software RAID 1 setup in my CentOS 6.2 and set to be bootable in any of the HDDs in case one of them fails.

Questions:

How can I recognize if one of the HDDs fail? or early signs of failure on CentOS? (preventive maintenance)
If one of the disks fails, what are the things to be done? Like any DATA RECOVERY methods. (corrective maintenance) or (how to copy the data in the remaining HDD and copy it to the new HDD)

I would gladly appreciate if you could give me any references.

Update:

I tried to boot in only one of the disks. I removed sdb first and the system successfully booted in sda. Then I removed sda and booted in sdb and still successfully booted. But when I put them back together and executed cat /proc/mdstat & mdadm -D /dev/md0, it shows that one of the disks is still removed.

Best Answer

If you are lucky (and have enabled the daemon) you will get SMART warnings in the logs before the disk fails. This is not guaranteed, however. In my experience I see SMART errors before disks blow up in less than 50% of the cases. Make sure you have something monitoring the logs.
After a disk failure you replace the disk and rebuild. The RAID system should recover from this. Just hope that you don't have another disk error while rebuilding...

I would highly recommend having a good backup strategy instead of planning for data recovery. Raid is perfect for improving uptime of a server, but all it takes is one little software bug and all your data is gone.

Related Solutions

Linux – Remake SW RAID1 from a new HDD and an old HDD with bad blocks

Can you verify whether the affected blocks and underlying bad sectors on the disk are reallocated to "spare sectors" area? The bad sector should be reallocated when write operation fails. Verify it with smartctl:

 smartctl -a /dev/sdb | grep -i reallocated

The last column should contain a number of total reallocated sectors. If there is zero try to read the bad sector:

hdparm –-read-sector XXXXXXXX /dev/sdb

It should return an I/O error otherwise I would recommend to skip next section.

The error means the sector was not reallocated yet. So you can try to reallocate it forcibly by writing it. Remember that any data stored in this sector will be lost after this step !!!:

hdparm –-write-sector XXXXXXXX --yes-i-know-what-i-am-doing /dev/sdb

By the way, the sector number XXXXXXXX should be possible to obtain from kernel messages (dmesg command or from /var/log/messages). As you had bad blocks during resynchronisation there should be some related messages similar to:

... end_request: I/O error, dev sdb, sector 1261071601

Then, try to verify it with smartctl again. Does the counter increased? If so try to read it with hdparm. Now, it should read it without any error as it is supposed to be reallocated. Done.

Finally, you can continue with mdadm and with adding the disk to your degraded mirror.

How to get email alert if one of raid 1 disks fails

Blazer, it looks like in the process of improving your question (which is now a good one, by the way), you've found your own answer. Well done, you! But there is a little more that could usefully be said.

As far as I know, that mdadm.conf will suffice for you to get automated notifications. Certainly, mine looks very little different to that, and I know from a recent failout test that I get notifications. (I'm a little curious about the second slash in /dev/md/0, but if that's what your system wrote, it's very likely right.)

But it's axiomatic in professional sysadmin that, unless you've tested something, you can't really know that it works.

At the very least, you will want to check that you can send mail from that system, as root, to the specified gmail.com address, and receive it.

If I were you, I'd at least perform a soft failure test. You can do that with mdadm /dev/md0 -f /dev/sdb1. That will fail the second partition out of the array, and should generate a formal notification to you (check your system's mail logs to see if it's gone). Check the output of cat /proc/mdstat so you know what a half-bad array looks like.

You can resync the array later with mdadm /dev/md0 -a /dev/sdb1, and check that it's sync'ed back with another cat /proc/mdstat.

If you want to go the whole hog, schedule some downtime, try pulling one of the drives, and check that the system can still boot. Where the metadevice in question is the boot partition, people sometimes forget to have a GRUB boot block on both drives, so when the second one fails, their system becomes unbootable. Replace and resync the drive later.

Whatever tests you decide to do, document them, so that when there's a real failure, you know what to expect, and you can minimise the chance of pilot error trashing the second drive.

Best Answer

Related Solutions

Linux – Remake SW RAID1 from a new HDD and an old HDD with bad blocks

How to get email alert if one of raid 1 disks fails

Related Topic