Debian – Linux mdraid RAID 6, disks drop out randomly every few days

debianmdadmraidraid6ssd

I have some servers that run Debian 8 with 8x800GB SSD configured as RAID6. All disks are connected to a LSI-3008 flashed to IT mode. In each server I also have a 2-disk pair as RAID1 for the OS.

current state

# dpkg -l|grep mdad
ii  mdadm                          3.3.2-5+deb8u1              amd64        tool to administer Linux MD arrays (software RAID)

# uname -a
Linux R5U32-B 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux

# more /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md2 : active raid6 sde1[1](F) sdg1[3] sdf1[2] sdd1[0] sdh1[7] sdb1[6] sdj1[5] sdi1[4]
      4687678464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [U_UUUUUU]
      bitmap: 3/6 pages [12KB], 65536KB chunk

md1 : active (auto-read-only) raid1 sda5[0] sdc5[1]
      62467072 blocks super 1.2 [2/2] [UU]
        resync=PENDING

md0 : active raid1 sda2[0] sdc2[1]
      1890881536 blocks super 1.2 [2/2] [UU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

unused devices: <none>

# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jun 24 04:35:18 2016
     Raid Level : raid6
     Array Size : 4687678464 (4470.52 GiB 4800.18 GB)
  Used Dev Size : 781279744 (745.09 GiB 800.03 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Jul 19 17:36:15 2016
          State : active, degraded
 Active Devices : 7
Working Devices : 7
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : R5U32-B:2  (local to host R5U32-B)
           UUID : 24299038:57327536:4db96d98:d6e914e2
         Events : 2514191

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       2       0        0        2      removed
       2       8       81        2      active sync   /dev/sdf1
       3       8       97        3      active sync   /dev/sdg1
       4       8      129        4      active sync   /dev/sdi1
       5       8      145        5      active sync   /dev/sdj1
       6       8       17        6      active sync   /dev/sdb1
       7       8      113        7      active sync   /dev/sdh1

       1       8       65        -      faulty   /dev/sde1

Problem

The RAID 6 array degrades semi-regularly, every 1-3 days or so. The reason for this is that one (any one) of its disks show up as faulty with the following error:

#dmesg -T
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: attempting task abort! scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: [sde] CDB:
[Sat Jul 16 05:38:45 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: handle(0x000d), sas_address(0x500304801707a443), phy(3)
[Sat Jul 16 05:38:45 2016] scsi target0:0:3: enclosure_logical_id(0x500304801707a47f), slot(3)
[Sat Jul 16 05:38:46 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff8810350cbe00)
[Sat Jul 16 05:38:46 2016] end_request: I/O error, dev sde, sector 2064
[Sat Jul 16 05:38:46 2016] md: super_written gets error=-5, uptodate=0
[Sat Jul 16 05:38:46 2016] md/raid:md2: Disk failure on sde1, disabling device.md/raid:md2: Operation continuing on 7 devices.
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016]  --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016]  disk 0, o:1, dev:sdd1
[Sat Jul 16 05:38:46 2016]  disk 1, o:0, dev:sde1
[Sat Jul 16 05:38:46 2016]  disk 2, o:1, dev:sdf1
[Sat Jul 16 05:38:46 2016]  disk 3, o:1, dev:sdg1
[Sat Jul 16 05:38:46 2016]  disk 4, o:1, dev:sdi1
[Sat Jul 16 05:38:46 2016]  disk 5, o:1, dev:sdj1
[Sat Jul 16 05:38:46 2016]  disk 6, o:1, dev:sdb1
[Sat Jul 16 05:38:46 2016]  disk 7, o:1, dev:sdh1
[Sat Jul 16 05:38:46 2016] RAID conf printout:
[Sat Jul 16 05:38:46 2016]  --- level:6 rd:8 wd:7
[Sat Jul 16 05:38:46 2016]  disk 0, o:1, dev:sdd1
[Sat Jul 16 05:38:46 2016]  disk 2, o:1, dev:sdf1
[Sat Jul 16 05:38:46 2016]  disk 3, o:1, dev:sdg1
[Sat Jul 16 05:38:46 2016]  disk 4, o:1, dev:sdi1
[Sat Jul 16 05:38:46 2016]  disk 5, o:1, dev:sdj1
[Sat Jul 16 05:38:46 2016]  disk 6, o:1, dev:sdb1
[Sat Jul 16 05:38:46 2016]  disk 7, o:1, dev:sdh1
[Sat Jul 16 12:40:00 2016] sd 0:0:7:0: attempting task abort! scmd(ffff88000d76eb00)

Already tried

I have already tried the following, with no improvement:

  • increase /sys/block/md2/md/stripe_cache_size from 256 to 16384
  • increase dev.raid.speed_limit_min from 1000 to 50000

Need your help

Are these errors caused by mdadm configuration or the kernel or the controller?

Update 20160802

Follow the advice of ppetraki and others:

  • Use raw disk instead partition

    This doesn't solve the issue

  • Decrease chunk size

    The chunk size has beed modified to 128KB then 64KB but the RAID volume still degraded in few day. From dmesg is showing similar with previous error. I forget to try to reduce chunk size to 32KB.

  • Reduce number of RAID to 6 disks

    I've tried to destroy existing RAID, zeroing superblock on each disk and create RAID6 with 6 disks (in raw disk) and 64KB chunks. Decrease number of disk RAID seems make array live longer, around 4-7 days before degraded

  • Update the driver

I just update the driver to Linux_Driver_RHEL6-7_SLES11-12_P12 (http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9300-8e). Disk error still appear like below

[Tue Aug  2 17:57:48 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880fc0dd1980)
[Tue Aug  2 17:57:48 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 17:57:48 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 17:57:48 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)
[Tue Aug  2 17:57:48 2016] scsi target0:0:6: enclosure_logical_id(0x50030480173ee97f), slot(6)
[Tue Aug  2 17:57:49 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880fc0dd1980)
[Tue Aug  2 17:57:49 2016] end_request: I/O error, dev sdg, sector 0

Just a few moments ago, I have array degraded. This time /dev/sdf and /dev/sdg show error "attempting task abort! scmd"

[Tue Aug  2 21:26:02 2016]  
[Tue Aug  2 21:26:02 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:02 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug  2 21:26:02 2016] scsi target0:0:5: enclosure level(0x0000), connector name(     ^A)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88103beb5240)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88107934e080)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:03 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug  2 21:26:03 2016] scsi target0:0:5: enclosure level(0x0000), connector name(     ^A)
[Tue Aug  2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88107934e080)
[Tue Aug  2 21:26:04 2016] sd 0:0:5:0: [sdf] CDB:
[Tue Aug  2 21:26:04 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         sas_address(0x50030480173ee945), phy(5)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         enclosure logical id(0x50030480173ee97f), slot(5)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         enclosure level(0x0000), connector name(     ^A)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         handle(0x000f), ioc_status(success)(0x0000), smid(35)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         request_len(4096), underflow(4096), resid(-4096)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         tag(65535), transfer_count(8192), sc->result(0x00000000)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[Tue Aug  2 21:26:04 2016] mpt3sas_cm0:         [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880931d8c840)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 22:14:51 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: enclosure logical id(0x50030480173ee97f), slot(6)
[Tue Aug  2 22:14:51 2016] scsi target0:0:6: enclosure level(0x0000), connector name(     ^A)
[Tue Aug  2 22:14:51 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880931d8c840)
[Tue Aug  2 22:14:52 2016] sd 0:0:6:0: [sdg] CDB:
[Tue Aug  2 22:14:52 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         sas_address(0x50030480173ee946), phy(6)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         enclosure logical id(0x50030480173ee97f), slot(6)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         enclosure level(0x0000), connector name(     ^A)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         handle(0x0010), ioc_status(success)(0x0000), smid(85)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         request_len(0), underflow(0), resid(-8192)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         tag(65535), transfer_count(8192), sc->result(0x00000000)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[Tue Aug  2 22:14:52 2016] mpt3sas_cm0:         [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[Tue Aug  2 22:14:52 2016] end_request: I/O error, dev sdg, sector 16
[Tue Aug  2 22:14:52 2016] md: super_written gets error=-5, uptodate=0
[Tue Aug  2 22:14:52 2016] md/raid:md2: Disk failure on sdg, disabling device. md/raid:md2: Operation continuing on 5 devices.
[Tue Aug  2 22:14:52 2016] RAID conf printout:
[Tue Aug  2 22:14:52 2016]  --- level:6 rd:6 wd:5
[Tue Aug  2 22:14:52 2016]  disk 0, o:1, dev:sdc
[Tue Aug  2 22:14:52 2016]  disk 1, o:1, dev:sdd
[Tue Aug  2 22:14:52 2016]  disk 2, o:1, dev:sde
[Tue Aug  2 22:14:52 2016]  disk 3, o:1, dev:sdf
[Tue Aug  2 22:14:52 2016]  disk 4, o:0, dev:sdg
[Tue Aug  2 22:14:52 2016]  disk 5, o:1, dev:sdh
[Tue Aug  2 22:14:52 2016] RAID conf printout:
[Tue Aug  2 22:14:52 2016]  --- level:6 rd:6 wd:5
[Tue Aug  2 22:14:52 2016]  disk 0, o:1, dev:sdc
[Tue Aug  2 22:14:52 2016]  disk 1, o:1, dev:sdd
[Tue Aug  2 22:14:52 2016]  disk 2, o:1, dev:sde
[Tue Aug  2 22:14:52 2016]  disk 3, o:1, dev:sdf
[Tue Aug  2 22:14:52 2016]  disk 5, o:1, dev:sdh

I assume that error "attempting task abort! scmd" lead to degraded on array, but doesn't know what cause it.

Update 20160806

I've tried set other server with the same specs. Without mdadm RAID, each disk is mounted directly under ext4 filesystem. After a while kernel log show "attempting task abort! scmd" on some disks. This lead /dev/sdd1 error then remount to read-only mode

$ dmesg -T
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:09 2016] Read(10): 28 00 2d 29 21 00 00 00 20 00
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88006b206800)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: attempting task abort! scmd(ffff88019a3a07c0)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:09 2016] Read(10): 28 00 08 46 8f 80 00 00 20 00
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug  6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug  6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88019a3a07c0)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: attempting device reset! scmd(ffff880f9a49ac80)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: [sdd] CDB:
[Sat Aug  6 05:21:10 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug  6 05:21:10 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)
[Sat Aug  6 05:21:10 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)
[Sat Aug  6 05:21:10 2016] sd 0:0:3:0: device reset: SUCCESS scmd(ffff880f9a49ac80)
[Sat Aug  6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug  6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug  6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
[Sat Aug  6 05:21:11 2016] end_request: I/O error, dev sdd, sector 780443696
[Sat Aug  6 05:21:11 2016] Aborting journal on device sdd1-8.
[Sat Aug  6 05:21:11 2016] EXT4-fs error (device sdd1): ext4_journal_check_start:56: Detected aborted journal
[Sat Aug  6 05:21:11 2016] EXT4-fs (sdd1): Remounting filesystem read-only
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88024fc08340)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug  6 05:40:35 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: task abort: FAILED scmd(ffff88024fc08340)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88019a12ee00)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:
[Sat Aug  6 05:40:35 2016] Read(10): 28 00 27 c8 b4 e0 00 00 20 00
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)
[Sat Aug  6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88019a12ee00)
[Sat Aug  6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88203eaddac0)

Update 20160930

After the controller firmware was upgraded to latest version (currently) 12.00.02, the issue dissapeared

Conclusion

The issue is solved

Best Answer

That's a pretty big stripe, 8-2=6 * 512K = 3MiB; Not an even one either. Bring your array to 10 disks (8 data + 2 parity) or down to 4 + 2 parity with a total stripe size of 256K or 64K per drive. It could be that the cache is mad at you for unaligned writes. You could try putting all the drives in write-through mode before you attempt to reconfigure the array.

Update 7/20/16.

At this point I'm convinced that your RAID configuration is the problem. A 3MiB stripe is just odd, even if it's a multiple of your partition offset [1] (1MiB) it's just a sub-optimal stripe size for any RAID, SSD or otherwise. It's probably generating tons of unaligned writes, which is forcing your SSD to free up more pages than it has readily available, which pushes it into the garbage collector constantly, and is shortening it's useful life. The drive just can't get free pages available fast enough for writes so when you finally flush the cache to disk (synchronize write), it literally fails. You do not have a crash consistent array e.g. your data is not safe.

That's my theory based on the available information and the time I can spend on it. You now have before you a "growth opportunity" to become a storage expert ;)

Start over. Don't use partitions. Set a system aside and build an array that has a total stripe size of 128K (little more conservative to start). In RAID 6 configuration of N total drives, only N-2 drives get the data at any one time and the remaining two store parity information. So if N=6, a 128K stripe would require 32K chunks. You should be able to see now why 8 is kinda an odd number to run a RAID 6.

Then run fio [2] against the "raw disk" in direct mode and beat on it until your confident it's solid. Next add the filesystem and inform it of the underlying stripe size (man mkfs.???). Run fio again but this time use files (or you'll destroy the filesystem) and confirm the array stays up.

I know this is a lot of "stuff", just start small, try and understand what it's doing, and keep at it. Tools like blktrace and iostat can help you understand how you're applications are writing which will inform you of the best stripe/chunk size to use.

  1. https://www.percona.com/blog/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/

(my fio cheatsheet) 2. https://wiki.mikejung.biz/Benchmarking#Fio_Random_Write_and_Random_Read_Command_Line_Examples