RAID – Troubleshooting a Failing 2nd Drive in RAID 1

hard drivelinuxmdadmraidsoftware-raid

I have a bit of a problem here. I have a Ubuntu Linux server set up with 2 SAS drives in a software RAID 1 (created with mdadm). The RAID will run fine for a day, I can do cat /proc/mdstat and it shows that both disks are active and everything is healthy. Then unexpectedly the second disk will fail and it will drop into degraded mode.

I'll then remove the disk from the RAID set, reboot the server, then re-add the disk to the set. The RAID will rebuild itself without any problems and I'll have a healthy RAID 1 working again with the same disks. Then again, within 12-24 hours or so the second drive will fail.

The hard drives are brand new so I'd like to think that the hardware is ok. Here's the output I was able to capture from the kern.log and syslog at the time that the disk failed.

Can anyone translate this or have an idea of what might be happening?

Thanks!

Kern.log

 Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.180815] sd 2:0:0:0: Attached scsi generic sg1 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181086] sd 2:0:1:0: Attached scsi generic sg2 type 0
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.181376] sd 2:0:1:0: [sdb] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182584] sd 2:0:1:0: [sdb] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182591] sd 2:0:1:0: [sdb] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.182835] sd 2:0:0:0: [sda] 71096640 512-byte logical blocks: (36.4 GB/33.9 GiB)
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.183802] sd 2:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185146] sd 2:0:0:0: [sda] Write Protect is off
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.185151] sd 2:0:0:0: [sda] Mode Sense: cb 00 10 08
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.188191] sd 2:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.191403] sd 2:0:1:0: [sdb] Attached SCSI disk
Feb 28 20:34:55 CSTEP-APPS20 kernel: [    9.299351] sd 2:0:0:0: [sda] Attached SCSI disk
Mar  1 09:01:22 CSTEP-APPS20 kernel: [44807.010040] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:01:32 CSTEP-APPS20 kernel: [44817.560056] sd 2:0:1:0: [sdb] CDB: Test Unit Ready: 00 00 00 00 00 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device

and syslog

Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860060] mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
Mar  1 09:01:43 CSTEP-APPS20 kernel: [44827.860070] mptbase: ioc0: Initiating recovery
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470023] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470030] mptscsih: ioc0: attempting task abort! (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470035] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470050] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880156fa4c00)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.470073] scsi target2:0:0: Beginning Domain Validation
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720120] mptscsih: ioc0: attempting target reset! (sc=ffff88016197b400)
Mar  1 09:02:03 CSTEP-APPS20 kernel: [44848.720124] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.262008] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512073] mptscsih: ioc0: attempting bus reset! (sc=ffff88016197b400)
Mar  1 09:02:04 CSTEP-APPS20 kernel: [44849.512078] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:05 CSTEP-APPS20 kernel: [44850.046491] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553909] mptscsih: ioc0: attempting host reset! (sc=ffff88016197b400)
Mar  1 09:02:15 CSTEP-APPS20 kernel: [44860.553915] mptbase: ioc0: Initiating recovery
Mar  1 09:02:35 CSTEP-APPS20 kernel: [44879.870026] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88016197b400)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380147] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380153] sd 2:0:1:0: Device offlined - not ready after error recovery
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380167] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380285] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380403] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380407] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380416] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 88 00 00 10 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380429] end_request: I/O error, dev sdb, sector 55297928
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380562] __ratelimit: 24 callbacks suppressed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380566] raid1: sdb1: rescheduling sector 55295880
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380677] sd 2:0:1:0: [sdb] Unhandled error code
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380680] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380684] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 03 4b c7 c0 00 00 80 00
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380695] end_request: I/O error, dev sdb, sector 55297984
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380817] raid1: sdb1: rescheduling sector 55295936
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.380915] sd 2:0:1:0: rejecting I/O to offline device
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381019] end_request: I/O error, dev sdb, sector 63983488
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381142] md: super_written gets error=-5, uptodate=0
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381146] raid1: Disk failure on sdb1, disabling device.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.381148] raid1: Operation continuing on 1 devices.
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398144] scsi target2:0:0: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398226] scsi target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI WRFLOW PCOMP (6.25 ns, offset 127)
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.398295] scsi target2:0:1: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648493] scsi target2:0:1: Domain Validation Initial Inquiry Failed
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648623] scsi target2:0:1: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648691] scsi target2:0:1: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.648760] scsi target2:0:8: Beginning Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649386] scsi target2:0:8: Ending Domain Validation
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.649458] scsi target2:0:8: asynchronous
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653384] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653390]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653395]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.653399]  disk 1, wo:1, o:0, dev:sdb1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693763] RAID1 conf printout:
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693767]  --- wd:1 rd:2
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.693771]  disk 0, wo:0, o:1, dev:sda1
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.714266] raid1: sda1: redirecting sector 55295880 to another mirror
Mar  1 09:02:45 CSTEP-APPS20 kernel: [44890.719943] raid1: sda1: redirecting sector 55295936 to another mirror

Best Answer

It looks like device /dev/sdb is going offline. You might have a cabling problem, but it's just as likely that it's the disk. Conflicts with the disk firmware and the controller are certainly possible, too.

I'd run the manufacturer's diagnostics on the disks immediately. Just because they're brand new I wouldn't put them above suspicion of being defective. (In fact, being brand new I'd suspect them a little more than disks that had been running for a few months.)