How to get rid of a stubborn ‘removed’ device in mdadm

mdadm

One of my server's drives failed and so I removed the failed drive from all three relevant arrays, had the drive swapped out, and then added the new drive to the arrays. Two of the arrays worked perfectly. The third added the drive back as a spare, and there's an odd "removed" entry in the mdadm details.

I tried both

mdadm /dev/md2 --remove failed

and

mdadm /dev/md2 --remove detached

as suggested here and here, neither of which complained, but neither of which had any effect, either.

Does anyone know how I can get rid of that entry and get the drive added back properly? (Ideally without resyncing a third time, I've already had to do it twice and it takes hours. But if that's what it takes, that's what it takes.) The new drive is /dev/sda, the relevant partition is /dev/sda3.

Here's the detail on the array:

# mdadm --detail /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Wed Oct 26 12:27:49 2011
     Raid Level : raid1
     Array Size : 729952192 (696.14 GiB 747.47 GB)
  Used Dev Size : 729952192 (696.14 GiB 747.47 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Tue Nov 12 17:48:53 2013
          State : clean, degraded 
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : 2fdbf68c:d572d905:776c2c25:004bd7b2 (local to host blah)
         Events : 0.34665

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       19        1      active sync   /dev/sdb3

       2       8        3        -      spare   /dev/sda3

If it's relevant, it's a 64-bit server. It normally runs Ubuntu, but right now I'm in the data centre's "rescue" OS, which is Debian 7 (wheezy). The "removed" entry was there the last time I was in Ubuntu (it won't, currently, boot from the disk), so I don't think that's not some Ubuntu/Debian conflict (and they are, of course, closely related).


Update:

Having done extensive tests with test devices on a local machine, I'm just plain getting anomalous behavior from mdadm with this array. For instance, with /dev/sda3 removed from the array again, I did this:

mdadm /dev/md2 --grow --force --raid-devices=1

And that got rid of the "removed" device, leaving me just with /dev/sdb3. Then I nuked /dev/sda3 (wrote a file system to it, so it didn't have the raid fs anymore), then:

mdadm /dev/md2 --grow --raid-devices=2

…which gave me an array with /dev/sdb3 in slot 0 and "removed" in slot 1 as you'd expect. Then

mdadm /dev/md2 --add /dev/sda3

…added it — as a spare again. (Another 3.5 hours down the drain.)

So with the rebuilt spare in the array, given that mdadm's man page says

RAID-DEVICES CHANGES

When the number of devices is increased, any hot spares that are present will be activated immediately.

…I grew the array to three devices, to try to activate the "spare":

mdadm /dev/md2 --grow --raid-devices=3

What did I get? Two "removed" devices, and the spare. And yet when I do this with a test array, I don't get this behavior.

So I nuked /dev/sda3 again, used it to create a brand-new array, and am copying the data from the old array to the new one:

rsync -r -t -v --exclude 'lost+found' --progress /mnt/oldarray/* /mnt/newarray

This will, of course, take hours. Hopefully when I'm done, I can stop the old array entirely, nuke /dev/sdb3, and add it to the new array. Hopefully, it won't get added as a spare!

Best Answer

Well all of the usual options (listed in my question) failed, I had no choice but to:

  1. Remove /dev/sda3 from the array

  2. Nuke it

  3. Create a new degraded array containing it and an empty slot

  4. rsync the files from the old array to the new one

  5. Stop the old array

  6. Nuke /dev/sdb3

  7. Add /dev/sdb3 to the new array

It started off saying "spare, rebuilding" but once it was rebuilt, it got added to the array as an active drive.

Of course, this meant dealing with the knock-on effects of the array having changed (and as this was the root file system, those were a royal pain).

As far as I can tell, something had got corrupted in the definition of the previous array, because:

A) Adding the drive should have Just Worked(tm) like it did with the other two,

and

B) If not, shrinking and growing the array should have worked.