Linux – Has this raid1 software array failed? (mdadm)


Long version:
I am running a Red Hat Enterprise Linux 5 (REHL5) machine with software raid1 (mdadm).

A few days ago I went to backup some MySQL data and all of sudden I could no longer log into the machine. I typed in a username to login and then it would just sit there. If a pressed control sequences they would appear on the screen but it would never log in. It also did not respond to ctrl+alt+delete. So I did a hard power down.

I booted it back up and monitored the raid1 array via:

mdadm --detail /dev/md1

This array holds the root mount point.

It began to do a resync of the array. I am not sure if this happened because of the crash or just because I did a hard power down. Either way I let it finish:

[f@mysqldatanode ~]# mdadm --detail /dev/md1
        Version : 00.90.03
  Creation Time : Thu Apr 19 15:28:52 2007
     Raid Level : raid1
     Array Size : 479893568 (457.66 GiB 491.41 GB)
    Device Size : 479893568 (457.66 GiB 491.41 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Fri Dec 25 10:03:50 2009
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : ab4849de:1f4f41c4:defd01e8:a4979ca6
         Events : 0.78

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

I looked through some logs (/var/log/messages*) and found several messages like the one below indicating hard-drive trouble:

Dec 21 11:39:47 localhost kernel: sd 0:0:1:0: SCSI error: return code = 0x08000002
Dec 21 11:39:47 localhost kernel: sdb: Current: sense key: Medium Error
Dec 21 11:39:47 localhost kernel:     Additional sense: Unrecovered read error
Dec 21 11:39:47 localhost kernel: Info fld=0x3348912
Dec 21 11:39:47 localhost kernel: end_request: I/O error, dev sdb, sector 53774610
Dec 21 11:39:47 localhost kernel: raid1:md1: read error corrected (8 sectors at 53565760 on sdb2)
Dec 21 11:39:48 localhost kernel: raid1: sdb2: redirecting sector 53565648 to another mirror

So then I tried to look for badblocks and it locked up again in the same fashion.

[f@mysqldatanode ~]# badblocks -s /dev/md1
Checking for bad blocks (read-only test):               0/      479893568

So how should I go about evaluating the health of the two drives? Since the array in question holds the root mount point do I need to move them to another machine to analyze them?

Best Answer

You can fail the /dev/sdb device through mdadm (best make sure you fail the entire device i.e. all mds that runs off it) then check it for errors, but from what you are describing you are most likely better off just replacing the device.

I have had ide devices that failed on a regular basis, I kept re-adding the rejected device until finally the computer started hanging like you describe. Replacing the failing device solved the problem.

In either case you should make a backup as soon as possible.

Related Topic