I have a RAID bus controller: 3ware Inc 9550SX SATA-II RAID PCI-X
with four disks, with the following current state:
tw_cli> /c1 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 REBUILD-PAUSED 0% - 256K 931.303 OFF OFF
u1 SPARE OK - - - 465.753 - OFF
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 465.76 GB 976773168 WD-WCAS87320631
p1 OK u0 465.76 GB 976773168 WD-WCAS87223554
p2 DEGRADED u0 465.76 GB 976773168 WD-WCAS87159042
p3 OK u1 465.76 GB 976773168 WD-WMAYP6812676
p4 NOT-PRESENT - - - -
p5 NOT-PRESENT - - - -
p6 NOT-PRESENT - - - -
p7 NOT-PRESENT - - - -
Rebuilding is enabled. Somethimes it starts (Status: REBUILDING
), seemingly does things for a minute or so, then falls back to REBUILD-PAUSED
. The %RCmpl
never goes over 0%. Log (/var/log/messages
) says in about every five minutes:
Dec 5 23:41:57 somelinux kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=0.
Dec 5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x003A): Drive power on reset detected:port=1.
Dec 5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:port=1.
Dec 5 23:42:30 somelinux kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:port=1.
I'm new to this hardware, and I inherited the machine and the maintaining task. What could it indicate? How big is the trouble I have? What should I do?
New events
Dec 6 00:25:42 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current<4>3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:port=1.
Dec 6 00:25:42 somelinux kernel: : sense key=0x2
Dec 6 00:25:42 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec 6 00:25:42 somelinux kernel: end_request: I/O error, dev sdc, sector 144738143
Dec 6 00:25:42 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
Dec 6 00:25:42 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec 6 00:25:42 somelinux kernel: end_request: I/O error, dev sdc, sector 144738143
Dec 6 00:25:43 somelinux kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.
Dec 6 00:28:02 somelinux kernel: sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
Dec 6 00:28:02 somelinux kernel: ASC=0x4 ASCQ=0x0
Dec 6 00:28:02 somelinux kernel: end_request: I/O error, dev sdc, sector 104927621
Dec 6 00:28:02 somelinux kernel: xfs_force_shutdown(dm-0,0x2) called from line 956 of file fs/xfs/xfs_log.c. Return address = 0xc028860d
… and …
tw_cli> /c1 show
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-5 INOPERABLE - - 256K 931.303 OFF OFF
u1 SPARE OK - - - 465.753 - OFF
Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 465.76 GB 976773168 WD-WCAS87320631
p1 NOT-PRESENT - - - -
p2 OK u0 465.76 GB 976773168 WD-WCAS87159042
p3 OK u1 465.76 GB 976773168 WD-WMAYP6812676
p4 NOT-PRESENT - - - -
p5 NOT-PRESENT - - - -
p6 NOT-PRESENT - - - -
p7 NOT-PRESENT - - - -
It seems that p1 is in really bad shape.
Folow up
It always worked for some minutes / hours before becoming INOPERABLE. That way I managed to make a backup of the data. I was very lucky. I learned that I need to pay closer attention, otherwise there is no point in having redundant storage.
Deleted the old array. Removed the faulty disk. Defined a new array with 3 good members. Recreated file-systems. Restored backups. Happy end.
Best Answer
Brace yourself.
Your RAID 5 is dead:
That's also the reason for the SCSI / I/O errors. Your RAID 5 isn't 4 disks; it's only 3. The fourth disk, p3, is in its own unit, u1, not the primary unit, u0.
Judging from the text you've provided, here's what probably happened:
The fact that p2 is now showing "OK" is irrelevant in relation to the status of the RAID 5.
I hope this server has backups, because it's unlikely you'll be able to recover this. I don't believe tw_cli supports forcing an array online, either. While the following won't help you retrieve data from this failed array, here's what I recommend:
Whoever set this up as a RAID 5 with a spare (it isn't setup properly, either) wasn't the brightest.