Linux – 3ware: Drive power on reset when trying to rebuild

3warelinux

I have a RAID bus controller: 3ware Inc 9550SX SATA-II RAID PCI-X with four disks, with the following current state:

tw_cli> /c1 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    REBUILD-PAUSED 0%      -       256K    931.303   OFF    OFF
u1    SPARE     OK             -       -       -       465.753   -      OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAS87320631
p1     OK               u0     465.76 GB   976773168     WD-WCAS87223554
p2     DEGRADED         u0     465.76 GB   976773168     WD-WCAS87159042
p3     OK               u1     465.76 GB   976773168     WD-WMAYP6812676
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     NOT-PRESENT      -      -           -             -
p7     NOT-PRESENT      -      -           -             -

Rebuilding is enabled. Somethimes it starts (Status: REBUILDING), seemingly does things for a minute or so, then falls back to REBUILD-PAUSED. The %RCmpl never goes over 0%. Log (/var/log/messages) says in about every five minutes:

Dec  5 23:41:57 somelinux kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=0.
Dec  5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x003A): Drive power on reset detected:port=1.
Dec  5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:port=1.
Dec  5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:port=1.

I'm new to this hardware, and I inherited the machine and the maintaining task. What could it indicate? How big is the trouble I have? What should I do?


New events

Dec  6 00:25:42 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current<4>3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:port=1.
Dec  6 00:25:42 somelinux kernel: : sense key=0x2
Dec  6 00:25:42 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec  6 00:25:42 somelinux kernel: end_request: I/O error, dev sdc, sector 144738143
Dec  6 00:25:42 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
Dec  6 00:25:42 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec  6 00:25:42 somelinux kernel: end_request: I/O error, dev sdc, sector 144738143
Dec  6 00:25:43 somelinux kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.
Dec  6 00:28:02 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
Dec  6 00:28:02 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec  6 00:28:02 somelinux kernel: end_request: I/O error, dev sdc, sector 104927621
Dec  6 00:28:02 somelinux kernel: xfs_force_shutdown(dm-0,0x2) called from line 956 of file fs/xfs/xfs_log.c.  Return address = 0xc028860d

… and …

tw_cli> /c1 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -       -       256K    931.303   OFF    OFF
u1    SPARE     OK             -       -       -       465.753   -      OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     465.76 GB   976773168     WD-WCAS87320631
p1     NOT-PRESENT      -      -           -             -
p2     OK               u0     465.76 GB   976773168     WD-WCAS87159042
p3     OK               u1     465.76 GB   976773168     WD-WMAYP6812676
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     NOT-PRESENT      -      -           -             -
p7     NOT-PRESENT      -      -           -             -

It seems that p1 is in really bad shape.


Folow up

It always worked for some minutes / hours before becoming INOPERABLE. That way I managed to make a backup of the data. I was very lucky. I learned that I need to pay closer attention, otherwise there is no point in having redundant storage.

Deleted the old array. Removed the faulty disk. Defined a new array with 3 good members. Recreated file-systems. Restored backups. Happy end.

Best Answer

Brace yourself.

Your RAID 5 is dead:

u0    RAID-5    INOPERABLE     -       -       256K    931.303   OFF    OFF

That's also the reason for the SCSI / I/O errors. Your RAID 5 isn't 4 disks; it's only 3. The fourth disk, p3, is in its own unit, u1, not the primary unit, u0.

Judging from the text you've provided, here's what probably happened:

  1. p2 is degraded and you tried to rebuild
  2. During this, p1 stopped being detected
  3. RAID 5 failure since 2 drives were not working/detected

The fact that p2 is now showing "OK" is irrelevant in relation to the status of the RAID 5.

I hope this server has backups, because it's unlikely you'll be able to recover this. I don't believe tw_cli supports forcing an array online, either. While the following won't help you retrieve data from this failed array, here's what I recommend:

  1. Replace the failed/missing drive (p1)
  2. As the card doesn't support RAID 6, we can't use that (recommended for large drives), so we'll have to go with RAID 10. Recreate with RAID 10, create the partitions, format/mount and update /etc/fstab.
  3. Restore from the backups I hope you have

Whoever set this up as a RAID 5 with a spare (it isn't setup properly, either) wasn't the brightest.