Convert ZFS mirror with 3 active disks into 2 + 1 hot spare

zfs

After installing Proxmox VE 3.4 with ZFS and raid1 using three disks I get the following pool:

root@pve:~# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0
            sdc3    ONLINE       0     0     0

If I understand the setup correctly all data will be mirrored over three disks. Even though this would support simulatenous failure of 2 drives at once I assume there is a performance penalty to writing everything to two disks. I think two would be enough.

How do I convert sdc3 to serve as a hot spare. I want it to automatically go active and replace the broken disk in case of a drive failure.

Best Answer

Note: using a hot spare in this situation probably is not the best idea. See below for the reasoning behind that, below the answer to the question as asked.


The answer to the question as asked:

Before reducing the pool redundancy, I strongly suggest letting one complete scrub run to ensure that all devices are functioning and that there are no latent data errors:

# zpool scrub rpool
... wait for it to finish, check zpool status for the status of the scrub ...

ZFS mirrors allow adding and removing mirror sides (referred to as attaching and detaching), so freeing sdc3 is easy:

# zpool detach rpool sdc3

You can then add it as a spare. You might need to labelclear it first (otherwise ZFS might complain that it is part of an existing pool):

# zpool labelclear /dev/sdc3

Note that after labelclear, ZFS will have no idea how to read the device, so this effectively deletes all data from it. Hence, be careful with the above command.

Then see what adding it as a hot spare would do, without making any changes:

# zpool add -n rpool spare /dev/sdc3

The result of the above should be a configuration similar to:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda3    ONLINE       0     0     0
        sdb3    ONLINE       0     0     0
      spares
        sdc3    AVAIL

Note that the "spares" section might not show; what's important here is that you don't add another vdev at the same level as mirror-0. In other words, the example below is wrong:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda3    ONLINE       0     0     0
        sdb3    ONLINE       0     0     0
      sdc3      ONLINE       0     0     0

Once you are satisfied that the command will do what you intend to, remove the -n to effect the change. Particularly, do not pass -f to zpool add unless you are absolutely certain that it's going to do what you want to do.

Note that the above only deals with actually configuring the device as a spare. I do not know how to configure it as a hot spare and automatic replacement on Proxmox VE.


As to why this might not be the best idea:

Remember that the hot spare will require a resilver once it becomes needed, and is not available to service any read requests during normal operation, so by doing this you are actually reducing your pool's resilience to failure, as well as potentially reducing read performance. Currently, if either of sd[abc]3 fails, you still have two drives functioning, providing redundancy; with the hot spare configuration, if either of sd[ab]3 fails, then the only remaining drive will be needed to support the full resilver onto the hot spare sdc3 without error. If any read errors are encountered on the single functional drive during the process of resilvering onto the hot spare to bring it up to date, then you lose data.

Assuming the HBA is able to keep up with the load, a N-way mirror will have the write performance of a single drive, regardless of the number of devices in the mirror, since all must be updated before the write can be considered done and the writes are performed in parallell across the physical devices. On read, depending on particulars you can get anywhere from the performance of one drive, up to the performance of all N drives.

If your workload is synchronous-write heavy, which is what I would expect is most likely to cause write performance contention, consider instead adding a good SSD to use as SLOG. That should improve write performance on synchronous writes. If most of your writes are async (which is normally the default, except for things like NFS) then you won't see much difference, but you also won't see much of a performance impact by the three-way mirror having the write performance of only one drive until you run out of ARC RAM and a two-way mirror won't be any faster; if this is the case, consider adding more RAM for ZFS to use as ARC.