Repairing RAID6 with two drive issues mdadm

mdadmraidraid6software-raid

I have had two disk failures on my RAID6 array. I have added two new disks and did the following:

  • I ran mdadm /dev/md1 –remove on the two disks
  • I set up my RAID on the 1st partition of each disk (for alignment reasons). As the replacement disks are aligned the same, I did a dd if=/dev/sdg (working disk) of=/dev/sde (new disk) bs=512 count=1 to copy over the partition layout. I am not sure if this is the right thing to do, as I probably copied mdadm superblock data.
  • I then ran mdadm /dev/md1 –add and the two disks.

I now have this when I run mdadm –detail /dev/md1:

Number   Major   Minor   RaidDevice State
   0       8        1        0      active sync   /dev/sda1
   1       8       17        1      active sync   /dev/sdb1
   6       8       65        2      spare rebuilding   /dev/sde1
   3       0        0        3      removed
   4       8       97        4      active sync   /dev/sdg1
   5       8      113        5      active sync   /dev/sdh1

   7       8       81        -      spare   /dev/sdf1

/proc/mdstat shows one disk as rebuilding, but not the other. I don't think this is right, as I think one disk is 'removed' and hasn't been replaced properly. The drive letters are exactly the same as the last two disks. Here is mdstat.

root@precise:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc1[0] sdd1[1]
  1953379136 blocks super 1.2 [2/2] [UU]

md1 : active raid6 sdf1[7](S) sde1[6] sdb1[1] sdh1[5] sda1[0] sdg1[4]
  11720521728 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [UU__UU]
  [>....................]  recovery =  2.2% (65163484/2930130432) finish=361.0min   speed=132257K/sec

unused devices: <none>`

I'd like to know (if this seems right), and what I need to do to fix the Number 3 entry and get /dev/sdf1 to take its place? I then assume that I will have a proper array again. What I find odd is adding /dev/sde1 seems to have allowed started a sync, but /dev/sdf1 has not taken the place of Number 3 Major 0 (RaidDevice 3)

All help appreciated

Cheers

Best Answer

First, let me reassure you: if your mdadm drives are partition-based (eg: sda1, etc), the first "dd" was OK and it did not cause any mdadm metadata copy (the metadata are inside the partition itself, not inside the MBR).

What you are observing is normal MDRAID behavior. You re-added the new drives using two separate mdadm -a commands, right? In this case, mdadm will first resync the first drive (putting the second one to "spare" mode) and then it will transition the second drive to "rebuilding spare" status. If you re-add the two drives with a single command (eg: mdadm /dev/mdX -a /dev/sdX1 /dev/sdY1) the rebuild will run concurrently.

Let have a look at my (testing) failed RAID6 arraid:

[root@kvm-black test]# mdadm --detail /dev/md200
/dev/md200:
        Version : 1.2
  Creation Time : Mon Feb  9 18:40:59 2015
     Raid Level : raid6
     Array Size : 129024 (126.02 MiB 132.12 MB)
  Used Dev Size : 32256 (31.51 MiB 33.03 MB)
   Raid Devices : 6
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Mon Feb  9 18:51:03 2015
          State : clean, degraded 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : localhost:md200  (local to host localhost)
           UUID : 80ed5f2d:86e764d5:bd6979ed:01c7997e
         Events : 105

    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       1       7        1        1      active sync   /dev/loop1
       2       7        2        2      active sync   /dev/loop2
       3       7        3        3      active sync   /dev/loop3
       4       0        0        4      removed
       5       0        0        5      removed

Re-adding the drives using two separate command (mdadm /dev/md200 -a /dev/loop6; sleep 1; mdadm /dev/md200 -a /dev/loop7) caused the following detailed report:

[root@kvm-black test]# mdadm --detail /dev/md200
/dev/md200:
        Version : 1.2
  Creation Time : Mon Feb  9 18:40:59 2015
     Raid Level : raid6
     Array Size : 129024 (126.02 MiB 132.12 MB)
  Used Dev Size : 32256 (31.51 MiB 33.03 MB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Mon Feb  9 18:56:40 2015
          State : clean, degraded, recovering 
 Active Devices : 4
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 2

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 9% complete

           Name : localhost:md200  (local to host localhost)
           UUID : 80ed5f2d:86e764d5:bd6979ed:01c7997e
         Events : 134

    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       1       7        1        1      active sync   /dev/loop1
       2       7        2        2      active sync   /dev/loop2
       3       7        3        3      active sync   /dev/loop3
       6       7        6        4      spare rebuilding   /dev/loop6
       5       0        0        5      removed

       7       7        7        -      spare   /dev/loop7

After some time:

[root@kvm-black test]# mdadm --detail /dev/md200
/dev/md200:
        Version : 1.2
  Creation Time : Mon Feb  9 18:40:59 2015
     Raid Level : raid6
     Array Size : 129024 (126.02 MiB 132.12 MB)
  Used Dev Size : 32256 (31.51 MiB 33.03 MB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Mon Feb  9 18:57:43 2015
          State : clean 
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : localhost:md200  (local to host localhost)
           UUID : 80ed5f2d:86e764d5:bd6979ed:01c7997e
         Events : 168

    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       1       7        1        1      active sync   /dev/loop1
       2       7        2        2      active sync   /dev/loop2
       3       7        3        3      active sync   /dev/loop3
       6       7        6        4      active sync   /dev/loop6
       7       7        7        5      active sync   /dev/loop7

Adding the two drives in a single command (mdadm /dev/md200 -a /dev/loop6 /dev/loop7) leads to that report:

[root@kvm-black test]# mdadm --detail /dev/md200
/dev/md200:
        Version : 1.2
  Creation Time : Mon Feb  9 18:40:59 2015
     Raid Level : raid6
     Array Size : 129024 (126.02 MiB 132.12 MB)
  Used Dev Size : 32256 (31.51 MiB 33.03 MB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Mon Feb  9 18:55:44 2015
          State : clean, degraded, recovering 
 Active Devices : 4
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 2

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 90% complete

           Name : localhost:md200  (local to host localhost)
           UUID : 80ed5f2d:86e764d5:bd6979ed:01c7997e
         Events : 122

    Number   Major   Minor   RaidDevice State
       0       7        0        0      active sync   /dev/loop0
       1       7        1        1      active sync   /dev/loop1
       2       7        2        2      active sync   /dev/loop2
       3       7        3        3      active sync   /dev/loop3
       7       7        7        4      spare rebuilding   /dev/loop7
       6       7        6        5      spare rebuilding   /dev/loop6

So, in the end: let mdadm do its magic, then check if all drives are marked as "active".

Related Topic