3Ware RAID6 array sometimes hanging. Undetected broken disk

3ware

We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.

We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.

Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify, things became responsive again.

I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:

# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM

and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.

'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms.

I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.

Or any other advice?

Edit:

this is the layout:

# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       256K    5587.9    RiW    OFF    
u1    SPARE     OK             -       -       -       1863.01   -      OFF    
u2    RAID-1    OK             -       -       -       1862.63   RiW    ON     

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   1.82 TB   SATA  0   -            ST32000542AS        
p1    OK             u0   1.82 TB   SATA  1   -            ST32000542AS        
p2    OK             u0   1.82 TB   SATA  2   -            ST32000542AS        
p3    OK             u0   1.82 TB   SATA  3   -            ST32000542AS        
p4    OK             u0   1.82 TB   SATA  4   -            ST32000542AS        
p5    OK             u1   1.82 TB   SATA  5   -            WDC WD2002FYPS-02W3 
p6    OK             u2   1.82 TB   SATA  6   -            WDC WD2002FYPS-02W3 
p7    OK             u2   1.82 TB   SATA  7   -            WDC WD2002FYPS-02W3 

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

The unit in question is u0.

edit2:

tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0 where X is an invalid port):

QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF

Legacy opcode=0xB1 error=0x10E

E=010E T=14:15:51     : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)

I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).

Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault…

Best Answer

2 things that were not brought up so far:

Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.
3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.

Related Solutions

Extending a live 3Ware RAID6 array in Linux with tw_cli

I may have figured it out. I had the opportunity to test it on a spare machine:

I first had this:

# tw_cli /c0 show
Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-5    OK             -       -       256K    5587.9    RiW    ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   1.82 TB   SATA  0   -            ST32000542AS
p1    OK             u0   1.82 TB   SATA  1   -            ST32000542AS
p2    OK             u0   1.82 TB   SATA  2   -            ST32000542AS
p3    OK             u0   1.82 TB   SATA  3   -            ST32000542AS
p4    OK             -    1.82 TB   SATA  4   -            ST32000542AS

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

A 4 disk raid 5 and one extra disk.

Then I did this:

# tw_cli /c0/u0 migrate type=raid5 disk=4
Sending migration message to /c0/u0 ... Done.

Then I had this:

# tw_cli /c0/u0 show

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       Migrator  MIGRATING      -       0%      -     -       -

su0      RAID-5    OK             -       -       -     256K    5587.9
su0-0    DISK      OK             -       -       p0    -       1862.63
su0-1    DISK      OK             -       -       p1    -       1862.63
su0-2    DISK      OK             -       -       p2    -       1862.63
su0-3    DISK      OK             -       -       p3    -       1862.63
su0/v0   Volume    -              -       -       -     -       50
su0/v1   Volume    -              -       -       -     -       5537.9

du0      RAID-5    OK             -       -       -     256K    7450.54
du0-0    DISK      OK             -       -       p0    -       1862.63
du0-1    DISK      OK             -       -       p1    -       1862.63
du0-2    DISK      OK             -       -       p2    -       1862.63
du0-3    DISK      OK             -       -       p3    -       1862.63
du0-4    DISK      OK             -       -       p4    -       1862.63
du0/v0   Volume    -              -       -       -     -       N/A
du0/v1   Volume    -              -       -       -     -       N/A

su0 and du0 are probably source and destination, giving me a new and bigger u0 at the end. I would think that du0/v0 and du0/v1 will become active when the migrating is done. But this is going to take a week to migrate and I don't know if I have the patience for that...

FreeBSD: how to associate disk serial numbers to device names when using a 3ware SATA card

For the twaX to daX config look in dmesg.boot:

$ cat /var/run/dmesg.boot
da0 at twa0 bus 0 scbus0 target 0 lun 0
da0: <AMCC 9690SA-4I  DISK 4.10> Fixed Direct Access SCSI-5 device 
da0: 100.000MB/s transfers
da0: 476827MB (976541696 512 byte sectors: 255H 63S/T 60786C)
da1 at twa0 bus 0 scbus0 target 1 lun 0

For the 'can I get these before I build an array part':

On my FreeBSD boxes I install 3dm2:

$ pkg_info 3dm\* 
Information for 3dm-2.09.01.004_1,1:

Comment:
3ware RAID controller monitoring daemon and web server


Description:
3DM 2 provides a web interface to remotely create, manage and monitor
your 3ware RAID arrays. In the event of a hardware failure, 3DM 2 can
automatically notify you via email.

WWW: http://www.3ware.com/support/

Once you get 3dm2 setup you can login to the web interface and under Information::Drive Information it will have:

 Extra Drive Info (Controller ID 0 - VPort 0)
  Drive Type    SATA
  Serial #  9YY0XX4N

Best Answer

Related Solutions

Extending a live 3Ware RAID6 array in Linux with tw_cli

FreeBSD: how to associate disk serial numbers to device names when using a 3ware SATA card

Related Topic