RAID – mdadm Did Not Notice a Failed Disk in RAID0

mdadmraid

I have a server with mdadm raid0:

# mdadm --version
mdadm - v3.1.4 - 31st August 2010
# uname -a
Linux orkan 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux

One of the disk has failed:

# grep sdf /var/log/kern.log | head
Jan 30 19:08:06 orkan kernel: [163492.873861] sd 2:0:9:0: [sdf] Unhandled error code
Jan 30 19:08:06 orkan kernel: [163492.873869] sd 2:0:9:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 30 19:08:06 orkan kernel: [163492.873874] sd 2:0:9:0: [sdf] Sense Key : Hardware Error [deferred] 

Right now in dmesg I can see:

Jan 31 15:59:49 orkan kernel: [238587.307760] sd 2:0:9:0: rejecting I/O to offline device
Jan 31 15:59:49 orkan kernel: [238587.307859] sd 2:0:9:0: rejecting I/O to offline device
Jan 31 16:03:58 orkan kernel: [238836.627865] __ratelimit: 10 callbacks suppressed
Jan 31 16:03:58 orkan kernel: [238836.627872] mdadm: sending ioctl 1261 to a partition!
Jan 31 16:03:58 orkan kernel: [238836.627878] mdadm: sending ioctl 1261 to a partition!
Jan 31 16:04:09 orkan kernel: [238847.215187] mdadm: sending ioctl 1261 to a partition!
Jan 31 16:04:09 orkan kernel: [238847.215195] mdadm: sending ioctl 1261 to a partition!

But mdadm did not notice that the drive has failed:

# mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Thu Jan 13 15:19:05 2011
     Raid Level : raid0
     Array Size : 71682176 (68.36 GiB 73.40 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Sep 22 14:37:24 2011
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 64K

           UUID : 7e018643:d6173e01:17ab5d05:f75b494e
         Events : 0.9

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       65        1      active sync   /dev/sde1
       2       8       81        2      active sync   /dev/sdf1

Also, forcing a read from /dev/md0 does support the theory that /dev/sdf has failed and yet mdadm does not mark the drive as failed:

# dd if=/dev/md0 of=/root/md.data bs=512 skip=255 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00367142 s, 139 kB/s

# dd if=/dev/md0 of=/root/md.data bs=512 skip=256 count=1
dd: reading `/dev/md0': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000359543 s, 0.0 kB/s

# dd if=/dev/md0 of=/root/md.data bs=512 skip=383 count=1
dd: reading `/dev/md0': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000422959 s, 0.0 kB/s

# dd if=/dev/md0 of=/root/md.data bs=512 skip=384 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000314845 s, 1.6 MB/s

However trying to access the /dev/sdf disk fails with:

# dd if=/dev/sdf of=/root/sdf.data bs=512 count=1
dd: opening `/dev/sdf': No such device or address

The data is not that important to me, I just want to understand why mdadm insists that the array is "State: clean"

Best Answer

The md(4) man page sheds some light on how the word "clean" is used (crucial bit italicized):

Unclean Shutdown

When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array there is a possibility of inconsistency for short periods of time as each update requires at least two block to be written to different devices, and these writes probably won't happen at exactly the same time. Thus if a system with one of these arrays is shutdown in the middle of a write operation (e.g. due to power failure), the array may not be consistent.

To handle this situation, the md driver marks an array as "dirty" before writing any data to it, and marks it as "clean" when the array is being disabled, e.g. at shutdown. If the md driver finds an array to be dirty at startup, it proceeds to correct any possibly inconsistency. For RAID1, this involves copying the contents of the first drive onto all other drives. For RAID4, RAID5 and RAID6 this involves recalculating the parity for each stripe and making sure that the parity block has the correct data. For RAID10 it involves copying one of the replicas of each block onto all the others. This process, known as "resynchronising" or "resync" is performed in the background. The array can still be used, though possibly with reduced performance.

If a RAID4, RAID5 or RAID6 array is degraded (missing at least one drive, two for RAID6) when it is restarted after an unclean shutdown, it cannot recalculate parity, and so it is possible that data might be undetectably corrupted. The 2.4 md driver does not alert the operator to this condition. The 2.6 md driver will fail to start an array in this condition without manual intervention, though this behaviour can be overridden by a kernel parameter.

It's plausible that a disk in the RAID failed after the RAID was safely and normally disabled by the system (at, e.g., a shutdown). In other words, the disk failure happened with the RAID in a consistent, synchronized state. The RAID would then be flagged "clean", and, when it was next enabled and one of its disks failed, the flag would remain.