How to Detect Hard Disk Failure

centos6.2hardwareraid1software-raid

I have a software RAID 1 setup in my CentOS 6.2 and set to be bootable in any of the HDDs in case one of them fails.

Questions:

  1. How can I recognize if one of the HDDs fail? or early signs of failure on CentOS? (preventive maintenance)
  2. If one of the disks fails, what are the things to be done? Like any DATA RECOVERY methods. (corrective maintenance) or (how to copy the data in the remaining HDD and copy it to the new HDD)

I would gladly appreciate if you could give me any references.

Update:

I tried to boot in only one of the disks. I removed sdb first and the system successfully booted in sda. Then I removed sda and booted in sdb and still successfully booted. But when I put them back together and executed cat /proc/mdstat & mdadm -D /dev/md0, it shows that one of the disks is still removed.

Best Answer

  1. If you are lucky (and have enabled the daemon) you will get SMART warnings in the logs before the disk fails. This is not guaranteed, however. In my experience I see SMART errors before disks blow up in less than 50% of the cases. Make sure you have something monitoring the logs.
  2. After a disk failure you replace the disk and rebuild. The RAID system should recover from this. Just hope that you don't have another disk error while rebuilding...

I would highly recommend having a good backup strategy instead of planning for data recovery. Raid is perfect for improving uptime of a server, but all it takes is one little software bug and all your data is gone.

Related Topic