Why isn’t the RAID array rebuilding

hphp-prolianthp-smart-arrayraid

Got a notice last night that a drive failed on a server. Got in this morning to replace it, and we're getting the following. Controller config report for the array looks fine, with the unusual status Ready for Rebuild.

 ~ # hpacucli controller all show config
Smart Array P400i in Slot 0 (Embedded)    (sn: XXXXXXXX     )
   array A (SAS, Unused Space: 0 MB)
   logicaldrive 1 (341.7 GB, RAID 5, Ready for Rebuild)
   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 72 GB, OK)
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 72 GB, OK)
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 72 GB, OK)
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
   physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 72 GB, OK)
   physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 72 GB, OK)

The logical drive shows a hint, Parity Initialization Status: Initialization Failed:

~ # hpacucli controller slot=0 logicaldrive 1 show 
Smart Array P400i in Slot 0 (Embedded)
   array A
      Logical Drive: 1
         Size: 341.7 GB
         Fault Tolerance: RAID 5
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 64 KB
         Full Stripe Size: 320 KB
         Status: Ready for Rebuild
         Array Accelerator: Enabled
         Parity Initialization Status: Initialization Failed
         Unique Identifier: XXXXXXX
         Disk Name: /dev/cciss/c0d0
         Mount Points: /boot 191 MB, / 28.6 GB
         OS Status: LOCKED
         Logical Drive Label: XXXXX     6797

Array configuration if it helps:

 ~ # /usr/sbin/hpacucli ctrl slot=0 show
Smart Array P400i in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   Serial Number: XXXXXXXX     
   Cache Serial Number: XXXXXXXX
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 1.18
   Rebuild Priority: Low
   Expand Priority: Low
   Surface Scan Delay: 15 secs
   Surface Scan Mode: Idle
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 50% Read / 50% Write
   Drive Write Cache: Disabled
   Total Cache Size: 256 MB
   Total Cache Memory Available: 208 MB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: False

How do I go about debugging this?

Edit:

All of the individual drives appear fine:

~ # hpacucli controller all show config detail | grep Status
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK
      Status: OK
         Status: Ready for Rebuild
         Parity Initialization Status: Initialization Failed
         OS Status: LOCKED
         Status: OK
         Status: OK
         Status: OK
         Status: OK
         Status: OK
         Status: OK

edit2:

I'm debugging some adverse interactions between hpaducli and grsec (also mp-SSH and Ubuntu) but we do have hpacucli diag results available, and buried in the Logical Drive Status Flags is Rebuild Aborted From Read Error. What confuses me here is how a read error during rebuild does not result in marking one of the drives predictive failure, or worse, but does cause a rebuild to stop.

Best Answer

Ready for Rebuild is a bad status if you're using a parity RAID level, like 5 or 6. It means that you likely have read errors on another drive in the array... e.g. another failing drive.

If the system is still online your best option is to recover data or rebuild. There's no good fix for this, and definitely not much you can do to debug.

See the following:

Force LUN in a HP Smart Array to rebuild

HP Proliant ML350 G5 SAS HDD

HP SmartArray P400: How to repair failed logical drive?

And of course: RAID-5: Two disks failed simultaneously?