HP SmartArray P400i marks a good drive as a failed, what can I do about it

hard drivehardware-raidhp-prolianthp-smart-arrayraid

I have the HP ProLiant DL360 G5 server with the SmartArray P400i RAID controller. The server itself is pretty old, but it still works normally. The only issue is the RAID controller, which marks good drives as failed. It happens quite often, almost every day. Here is the typical output of ssacli utility:

# ssacli ctrl all show config
...
   Array A (SATA, Unused Space: 0  MB)

      logicaldrive 1 (931.5 GB, RAID 1, Interim Recovery Mode)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA HDD, 1 TB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA HDD, 1 TB, Failed)

Detailed information:

# ssacli ctrl slot=0 pd 1I:1:2 show detail

Smart Array P400i in Slot 0 (Embedded)

   Array A

      physicaldrive 1I:1:2
         Port: 1I
         Box: 1
         Bay: 2
         Status: Failed
         Last Failure Reason: Not ready bad sense
         Drive Type: Data Drive
         Interface Type: SATA
         Size: 1 TB
         Drive exposed to OS: False
         Logical/Physical Block Size: 512/512
         Firmware Revision: SN03
         Serial Number: ...
         WWID: ...
         Model: ATA     ST91000640NS
         SATA NCQ Capable: True
         SATA NCQ Enabled: True
         PHY Count: 1
         PHY Transfer Rate: 1.5Gbps
         Sanitize Erase Supported: False
         Shingled Magnetic Recording Support: None

After the server is rebooted, the RAID controller detects the drive again, marks it as an OK, and rebuilds the array. The array works well until the next failure. I have no idea about why this is happening. Are there any ways to solve this problem without buying the new RAID or HBA controller? SoftRAID is acceptable. Currently I see these options:

  1. Make a JBOD-like setup with two RAID 0 logical drives, each containing a single physical drive, but I don't know if it will help.
  2. Tune the RAID controller so I won't exclude failed drives from the array, but I don't know how to do this.

Best Answer

I think the drive is bad. You can check this by SMART attributes of this drive.

When RAID controller finds a read/write/verification error on a drive, it marks this drive as FAILED. At this time the drive detects this error and starts the replacing sector procedure. It increases the current pending sectors counter and tries to read bad sector successfully. After successful read data from bad sector the disk writes one to a preallocated sector, decreases the current pending sectors counter and increases the reallocated sectors counter. Nonzero values of these SMART counters show that you have disk problems.

After successful sector reallocation procedure the RAID controller can successfully rebuild the disk array.

Also disk has SMART attributes which show errors in data transfer via the interface cable. A bad cable can give same symptoms of RAID controller actions. But disk problems occur more often than cable problems.

Please read SMART on Wikipedia