3Ware 9650SE RAID-6, two degraded drives, one ECC, rebuild stuck

3wareraidraid6

This morning I came in the office to discover that two of the drives on a RAID-6, 3ware 9650SE controller were marked as degraded and it was rebuilding the array. After getting to about 4%, it got ECC errors on a third drive (this may have happened when I attempted to access the filesystem on this RAID and got I/O errors from the controller). Now I'm in this state:

> /c2/u1 show

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u1       RAID-6    REBUILDING     4%(A)   -       -     64K     7450.5    
u1-0     DISK      OK             -       -       p5    -       931.312   
u1-1     DISK      OK             -       -       p2    -       931.312   
u1-2     DISK      OK             -       -       p1    -       931.312   
u1-3     DISK      OK             -       -       p4    -       931.312   
u1-4     DISK      OK             -       -       p11   -       931.312   
u1-5     DISK      DEGRADED       -       -       p6    -       931.312   
u1-6     DISK      OK             -       -       p7    -       931.312   
u1-7     DISK      DEGRADED       -       -       p3    -       931.312   
u1-8     DISK      WARNING        -       -       p9    -       931.312   
u1-9     DISK      OK             -       -       p10   -       931.312   
u1/v0    Volume    -              -       -       -     -       7450.5    

Examining the SMART data on the three drives in question, the two that are DEGRADED are in good shape (PASSED without any Current_Pending_Sector or Offline_Uncorrectable errors), but the drive listed as WARNING has 24 uncorrectable sectors.

And, the "rebuild" has been stuck at 4% for ten hours now.

So:

How do I get it to start actually rebuilding? This particular controller doesn't appear to support /c2/u1 resume rebuild, and the only rebuild command that appears to be an option is one that wants to know what disk to add (/c2/u1 start rebuild disk=<p:-p...> [ignoreECC] according to the help). I have two hot spares in the server, and I'm happy to engage them, but I don't understand what it would do with that information in the current state it's in.

Can I pull out the drive that is demonstrably failing (the WARNING drive), when I have two DEGRADED drives in a RAID-6? It seems to me that the best scenario would be for me to pull the WARNING drive and tell it to use one of my hot spares in the rebuild. But won't I kill the thing by pulling a "good" drive in a RAID-6 with two DEGRADED drives?

Finally, I've seen reference in other posts to a bad bug in this controller that causes good drives to be marked as bad and that upgrading the firmware may help. Is flashing the firmware a risky operation given the situation? Is it likely to help or hurt wrt the rebuilding-but-stuck-at-4% RAID? Am I experiencing this bug in action?

Advice outside the spiritual would be much appreciated. Thanks.

Best Answer

I managed to get the RAID to rebuild by issuing the following command in tw_cli without pulling any drives or rebooting the system:

/c2/u1 set ignoreECC=on

The rebuild didn't proceed immediately, but at 2 AM the morning after I made this change, the rebuild started and about 6 hours later, it was complete. The drive with ECC errors had 24 bad sectors that have now been overwritten and reallocated by the drive (according to the SMART data). The filesystem seems intact, but I won't be surprised if I hit errors when I get to whatever data was on those sectors.

In any case, I'm much better off that I was before, and will likely be able to recover the majority of the data. Once I've gotten what I can I'll pop out the drive that's failing and have it rebuild onto a hot spare.