Debian – zfs – hotspare, replace, detach: ressource is busy

debianzfszfsonlinux

I'm pretty new to zfsonlinux. I've just succeeded in setting up a brand new server, with a Debian ROOT on ZFS. All is working fine, but I've got an issue with hot spare and replacing disks.

Here is my pool:

NAME                            STATE     READ WRITE CKSUM
mpool                           ONLINE       0     0     0
  mirror-0                      ONLINE       0     0     0
    ata-ST1XXXXXXXXXXA-part1    ONLINE       0     0     0
    ata-ST1XXXXXXXXXXB-part1    ONLINE       0     0     0
  mirror-1                      ONLINE       0     0     0
    ata-ST1XXXXXXXXXXC-part1    ONLINE       0     0     0
    ata-ST1XXXXXXXXXXD-part1    ONLINE       0     0     0
spares  
  ata-ST1XXXXXXXXXXE-part1      AVAIL   
  ata-ST1XXXXXXXXXXF-part1      AVAIL

Now, I'm able to start with the real fun. Disk pulling! I'm now unplugging disk C. I got a working pool, but DEGRADED (as expected):

NAME                            STATE     READ WRITE CKSUM
mpool                           ONLINE       0     0     0
  mirror-0                      ONLINE       0     0     0
    ata-ST1XXXXXXXXXXA-part1    ONLINE       0     0     0
    ata-ST1XXXXXXXXXXB-part1    ONLINE       0     0     0
  mirror-1                      DEGRADED     0     0     0
    ata-ST1XXXXXXXXXXC-part1    UNAVAIL      0     0     0
    ata-ST1XXXXXXXXXXD-part1    ONLINE       0     0     0
spares  
  ata-ST1XXXXXXXXXXE-part1      AVAIL   
  ata-ST1XXXXXXXXXXF-part1      AVAIL

So far, so good. But, when I try to replace disk C with, let's say, disk E, I'm stuck with a DEGRADED pool anyway.

# zpool replace mpool ata-ST1XXXXXXXXXXC-part1 ata-ST1XXXXXXXXXXE-part1
cannot open '/dev/disk/by-id/ata-ST1XXXXXXXXXXE-part1': Device or ressource busy
(and after a few sec)
Make sure to wait until resilver is done before rebooting.

So I'm waiting some secs to let resilvering (with 0 errors), then I've got:

NAME                                STATE     READ WRITE CKSUM
mpool                               ONLINE       0     0     0
  mirror-0                          ONLINE       0     0     0
    ata-ST1XXXXXXXXXXA-part1        ONLINE       0     0     0
    ata-ST1XXXXXXXXXXB-part1        ONLINE       0     0     0
  mirror-1                          DEGRADED     0     0     0
    spare-0                         UNAVAIL
        ata-ST1XXXXXXXXXXC-part1    UNAVAIL      0     0     0
        ata-ST1XXXXXXXXXXE-part1    ONLINE       0     0     0
    ata-ST1XXXXXXXXXXD-part1        ONLINE       0     0     0
spares  
  ata-ST1XXXXXXXXXXE-part1          INUSE       currently in use   
  ata-ST1XXXXXXXXXXF-part1          AVAIL

Then if I zpool detach the C disk (as explained here), my pool is getting ONLINE again, and all is working fine (with a pool with only 5 HDD)

So here are my questions:

Why replacing the C disk is not enough to rebuild a full pool? As
explained on the oracle blog and here too I was expecting that I do not have to detach the
disk for zfs to rebuild the pool properly (and it's far better to
keep in the zpool status traces of the unplugged disk, for
maintening convenience)
Why zpool keep telling me that spares disks are "busy" (they are
truly not)?
See below: how can I automatically get my spare disk back?

EDIT: Even worst for question1 => When I plug back in disk C, zfs don't manage my spare back! So I'm left with one less disk

NAME                                STATE     READ WRITE CKSUM
mpool                               ONLINE       0     0     0
  mirror-0                          ONLINE       0     0     0
    ata-ST1XXXXXXXXXXA-part1        ONLINE       0     0     0
    ata-ST1XXXXXXXXXXB-part1        ONLINE       0     0     0
  mirror-1                          ONLINE       0     0     0
    ata-ST1XXXXXXXXXXE-part1        ONLINE       0     0     0
    ata-ST1XXXXXXXXXXD-part1        ONLINE       0     0     0
spares  
  ata-ST1XXXXXXXXXXF-part1          AVAIL

Best Answer

Short version:

You have to do it the other way round: replace the failed pool disk (with a new disk or with itself) and after that, detach the spare disk from the pool (so that it becomes available to all vdevs). I assume the spare is busy as long as the disk it was used to replace is not replaced itself. Detaching this disk or another disk only makes it worse.

Also, I remember that ZoL has no automatic attach/detach for spares depending on events, you have to script your own or use something like the ZFS event daemon.

Long version:

Regarding your follow-up comment

If C disk is FAULTED, ok let's replace it then detach it. But It scew up my pool, because zpool didnt remember I used to have a C disk in the mirror-1 :/

That depends on how you see it. If you detach a disk from a mirror, it is not relevant anymore. It may be defective, it may get used on another system, it may get replaced under manufacturer warranty. Whatever it is done with it, your pool does not care.

If you just detach the disk, then it will be degraded; if you instead supply another disk (from automatic spare, manual spare or fully manual replacement), this disk will assume the role of the old disk (hence the term replace, the new disk fully replaces the old disk in its position and its duties).

If you want, you can add the detached disk back to the pool, for example as a spare (so the initial situation is reversed).

How spares work on ZFS systems

The spares only really make sense with automatic activation. ZFS storage arrays as designed by Sun had many similar disks, amounts of 18 to 48 disks were not uncommon. They consisted of multiple vdevs, for example 4 x RAID Z2 for an 24 disk system. Additionally, they were managed by a dedicated administrator, but nobody can work 24/7. Therefore they needed something as first response, and it had to work on all vdevs, because any disk might fail at any moment.

So, if late at night a disk in your second vdev fails, the system automatically takes one of the two configured spares and replaces the faulted disk so that the pool works as usual (same performance for customers that use a website whose database runs on it, for example). In the morning, the admin reads the report of the failure and troubleshoots the cause:

If the disk has died, he might replace it with a replacement disk in the same tray, let it resilver and then the hotspare is automatically retired back to spare duty, watching for another dead disk where it can do first response.
If no replacement disk is available, he might even make the spare the new data disk, reducing the number of spares temporarily by 1 (until another replacement disk is shipped which will become the new spare).
If it was just a controller error dropping the disk, he might even replace it with itself, triggering the same spare renewal as in the first case.

If you think about it the way the engineers designed for the most common anticipated usage scenario, it will make much more sense. That does not mean that you have to do exactly as described, it just might be a reason for the behavior.

Answers to your questions

Why replacing the C disk is not enough to rebuild a full pool? As explained on the oracle blog and here too I was expecting that I do not have to detach the disk for zfs to rebuild the pool properly (and it's far better to keep in the zpool status traces of the unplugged disk, for maintening convenience)

As seen above, you can either replace the pool disk with another or itself (spare will be free and continues to work as spare), or you can detach the pool disk, whereas the spare will permanently assume the role of a pool disk and you have to add another spare by hand with zpool add poolname spare diskname (which can be the detached disk or a new one).

Why zpool keep telling me that spares disks are "busy" (they are truly not)?

I assume it was because of outstanding IO. That would explain why it just took a moment to complete the operation.

See below: how can I automatically get my spare disk back?

Enable automatic spare replacement (default on Solaris/illumos, bit of a hassle on Linux)
Replace the faulted pool disk with zpool replace (instead of detaching it). The detach step is only needed for the spare disk after replacement of the pool disk and if you do not have automatic management (which makes no sense in my eyes except for specific pool layouts and admin situations).

Related Solutions

Missing whole disk device in OpenSolaris

Try:

devfsadm -v

otherwise, it might be an EFI partition is required. It should be created with:

format -e

fdisk -E raw-device

Debian – ZFS endless resilvering

Congratulations and uh-oh. You've stumbled across one of the better things about ZFS, but also committed a configuration sin.

First, since you are using raidz1, you only have one disk worth of parity data. However, you had two drives fail contemporaneously. The only possible result here is data loss. No amount of resilvering is going to fix that.

Your spares helped you out a little bit here and saved you from a completely catastrophic failure. I'm going to go out on a limb here and say that the two drives that failed did not fail at the same time and that the first spare only partially resilvered before the second drive failed.

That seems hard to follow. Here's a picture:

sequence of events

This is actually a good thing because if this were a traditional RAID array, your entire array would have simply gone offline as soon as the second drive failed and you would have NO chance of an in-place recovery. But because this is ZFS, it can still run using the pieces it has and simply returns block or file level errors for the pieces it doesn't.

Here is how you fix it: Short-term, get a list of damaged files from zpool status -v and copy those files from backup to their original locations. Or delete the files. This will allow the resilver to resume and complete.

Here is your configuration sin: you have way too many drives in a raidz group.

Long term: you need to reconfigure your drives. A more appropriate configuration would be to arrange the drives in to small groups of 5 drives or so in raidz1. ZFS will automatically stripe across those small groups. This significantly reduces the resilver time when a drives fails because only 5 drives need to participate instead of all of them. The command to do this would be something like:

zpool create tank raidz da0 da1 da2 da3 da4 \
                  raidz da5 da6 da7 da8 da9 \
                  raidz da10 da11 da12 da13 da14 \
                  spare da15 spare da16