When does ZFS “autoreplace” take effect

hard driveubuntu-16.04zfs

Background

autoreplace is documented like the following:

autoreplace=on | off
Controls automatic device replacement. If set to "off", device replacement must be initiated by the administrator by using the "zpool replace" command. If set to "on", any new device, found in the same physical location as a device that previously belonged to the pool, is automatically formatted and replaced. The default behavior is "off". This property can also be referred to by its shortened column name, "replace".

The following is the current status of that setting in the pool I'm interested in:

root@[...]:/# zpool get autoreplace zfs-pool
NAME      PROPERTY     VALUE    SOURCE
zfs-pool  autoreplace  on       local

So it seems to be enabled.

Observations

One disk has been removed because of S.M.A.R.T.-related errors and ZFS properly recognised that device as not being available anymore. The mirror in which the disk has been used changed to DEGRADED etc. Because I had multiple spare disks, I used zpool replace zfs-pool FAULTY_DISK SPARE_DISK to temporarily put one spare in place. That's necessary because with the UB 16.04 I'm using, automatically using spares doesn't work properly or even at all.

After the mirror was in sync again and the new disk has been physically attached, I restarted the system, because otherwise the used controllers prevent access to the new disk. During starting up, controllers recognize new disks, ask if those should be enabled or not and in the former case, the new disk is available to the OS afterwards. The disk has been initialized, partitions created etc. and was fully available like the faulty one before at the same physical slot. The important thing is that the OS used the same naming for the disk like before as well: /dev/sdf and /dev/disk/by-path/pci-0000:15:00.0-scsi-0:1:0:1-part*

Nevertheless, ZFS didn't use the new disk automatically to replace the formerly one. Even though the status output of the pool mentioned the serial number of the old disk as missing and which path it had in the past, which was the same like the new disk got in the meantime already. I needed to issue a replacement of the new disk manually using zpool replace zfs-pool pci-0000:15:00.0-scsi-0:1:0:1-part3. That made ZFS put the new disk into the correct mirror, because of the same path, and after resilvering the spare has been removed automatically as well.

NAME                                         STATE     READ WRITE CKSUM
zfs-pool                                     DEGRADED     0     0     0
  mirror-0                                   ONLINE       0     0     0
    pci-0000:05:00.0-scsi-0:1:0:0-part3      ONLINE       0     0     0
    pci-0000:15:00.0-scsi-0:1:0:0-part3      ONLINE       0     0     0
  mirror-1                                   DEGRADED     0     0     0
    pci-0000:05:00.0-scsi-0:1:0:1-part3      ONLINE       0     0     0
    spare-1                                  DEGRADED     0     0     0
      replacing-0                            DEGRADED     0     0     0
        11972718311040401135                 UNAVAIL      0     0     0  was /dev/disk/by-path/pci-0000:15:00.0-scsi-0:1:0:1-part3/old
        pci-0000:15:00.0-scsi-0:1:0:1-part3  ONLINE       0     0     0  (resilvering)
      pci-0000:15:00.0-scsi-0:1:0:3-part3    ONLINE       0     0     0
  mirror-2                                   ONLINE       0     0     0
    pci-0000:05:00.0-scsi-0:1:0:2-part3      ONLINE       0     0     0
    pci-0000:15:00.0-scsi-0:1:0:2-part3      ONLINE       0     0     0
spares
  pci-0000:05:00.0-scsi-0:1:0:3-part3        AVAIL
  pci-0000:15:00.0-scsi-0:1:0:3-part3        INUSE     currently in use

Questions

While the used command is document to work that way, I wonder why it was necessary with autoreplace being enabled? Shouldn't that have done that one step instantly after the new disk was partitioned successfully? Or is the property autoreplace necessary for the issued command to work at all? It's not documented to rely on that setting:

zpool replace [-f] pool old_device [new_device]
[…]
new_device is required if the pool is not redundant. If new_device is not specified, it defaults to old_device. This form of replacement is useful after an existing disk has failed and has been physically replaced. In this case, the new disk may have the same /dev/dsk path as the old device, even though it is actually a different disk. ZFS recognizes this.

Best Answer

ZFS depends on ZED to handle auto-replacing a failing/disconnected disks, so you must be sure ZED is running. However, latest 0.8.x ZED releases have a bug which prevent ZFS to correctly auto-partition the replaced disk. Note that this bug is not present on 0.7.x ZFS/ZED releases.

EDIT: some answers based on your comments below:

does ZED autoreplace "internally" somehow or are scripts necessary like for using hot spares and other actions? ZED handles autoreplace internally in its FMA (fault management agent). In other words, no script are required in the agent directory. These script generally runs after the FMA, and are supposed to start corollary actions as start a scrub, log to syslog, etc
where can I find details about auto-partitioning applied in case of autoreplace? I'm forwarding individual partitions to ZFS instead of whole disks. auto-partitioning only works when passing whole disk to ZFS (please note that it is ZFS itself, rather than ZED, to partition the affected disks). When passing existing partitions to ZFS (ie: using sda1 as vdev) the partition table is not touched at all.

Related Solutions

Missing whole disk device in OpenSolaris

Try:

devfsadm -v

otherwise, it might be an EFI partition is required. It should be created with:

format -e

fdisk -E raw-device

Debian – ZFS endless resilvering

Congratulations and uh-oh. You've stumbled across one of the better things about ZFS, but also committed a configuration sin.

First, since you are using raidz1, you only have one disk worth of parity data. However, you had two drives fail contemporaneously. The only possible result here is data loss. No amount of resilvering is going to fix that.

Your spares helped you out a little bit here and saved you from a completely catastrophic failure. I'm going to go out on a limb here and say that the two drives that failed did not fail at the same time and that the first spare only partially resilvered before the second drive failed.

That seems hard to follow. Here's a picture:

sequence of events

This is actually a good thing because if this were a traditional RAID array, your entire array would have simply gone offline as soon as the second drive failed and you would have NO chance of an in-place recovery. But because this is ZFS, it can still run using the pieces it has and simply returns block or file level errors for the pieces it doesn't.

Here is how you fix it: Short-term, get a list of damaged files from zpool status -v and copy those files from backup to their original locations. Or delete the files. This will allow the resilver to resume and complete.

Here is your configuration sin: you have way too many drives in a raidz group.

Long term: you need to reconfigure your drives. A more appropriate configuration would be to arrange the drives in to small groups of 5 drives or so in raidz1. ZFS will automatically stripe across those small groups. This significantly reduces the resilver time when a drives fails because only 5 drives need to participate instead of all of them. The command to do this would be something like:

zpool create tank raidz da0 da1 da2 da3 da4 \
                  raidz da5 da6 da7 da8 da9 \
                  raidz da10 da11 da12 da13 da14 \
                  spare da15 spare da16