Zpool Recovery – Using dd to Scavenge Donor Disks

data-recoveryillumosomnioszfszpool

I am in the process of trying to recover a pool that had been degraded and neglected, then had a second mirror member fail, resulting in a faulted pool. For whatever reason, the spare never autoreplaced, even though that option was set for this pool, but that's beside the point.

This is on an OmniOS server. Pool info is as follows:

  pool: dev-sata1
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: resilvered 1.53T in 21h6m with 0 errors on Sat Jun 17 13:18:04 2017
config:

        NAME                       STATE     READ WRITE CKSUM
        dev-sata1                  UNAVAIL    227   623     0  insufficient replicas
          mirror-0                 ONLINE       0     0     0
            c1t5000C5003ECEEC42d0  ONLINE       0     0     0
            c1t5000C5003ED6D008d0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c1t5000C500930358EAd0  ONLINE       0     0     0
            c1t5000C500930318E1d0  ONLINE       0     0     0
          mirror-3                 ONLINE       0     0     0
            c1t5000C5003F362DA7d0  ONLINE       0     0     0
            c1t5000C5003F365D94d0  ONLINE       0     0     0
          mirror-4                 ONLINE       0     0     0
            c1t5000C50064D11652d0  ONLINE       0     0     0
            c1t5000C500668EC894d0  ONLINE       0     0     0
          mirror-5                 ONLINE       0     0     0
            c1t5000C5007A2DBE23d0  ONLINE       0     0     0
            c1t5000C5007A2DF29Cd0  ONLINE       0     0     0
          mirror-6                 UNAVAIL    457 1.22K     5  insufficient replicas
            15606980839703210365   UNAVAIL      0     0     0  was /dev/dsk/c1t5000C5007A2E1359d0s0
            c1t5000C5007A2E1BAEd0  FAULTED     37 1.25K     5  too many errors
          mirror-7                 ONLINE       0     0     0
            c1t5000C5007A34981Bd0  ONLINE       0     0     0
            c1t5000C5007A3929B6d0  ONLINE       0     0     0
        logs
          mirror-2                 ONLINE       0     0     0
            c1t55CD2E404B740DD3d0  ONLINE       0     0     0
            c1t55CD2E404B7591BEd0  ONLINE       0     0     0
        cache
          c1t50025388A0952EB0d0    ONLINE       0     0     0
        spares
          c1t5000C5002CD7AFB6d0    AVAIL

The disk "c1t5000C5007A2E1BAEd0" is currently at a data recovery facility, but they have exhausted the supply of replacement heads, including those from donor disks we have supplied. The disk marked as missing was eventually found, and could potentially be recovered, but it's a last result because I have no idea how out of date it is compared to the rest, and what that would mean for consistency. To be considered a donor, the first 3 letters of the serial needs to match, as well as the site code. I have 4 other disks in the pool that match that criteria and were healthy at the time the pool went down.

So, on to my question: Is it possible for me to substitute the 4 other possibly-donor-compatible disks(based on serial number) disks with 4 new disks after using dd to copy the entire donor disk to the new disk for each?

I am not clear on whether the pool requires the WWN or serial to match what it has stored (if it stores anything besides the cache) when importing a disk, or whether it scans for metadata on each disk to determine if it can import a pool. If the latter is true, is my strategy to obtain 4 more donor disks feasible?

Best Answer

Definitely don't use dd! ZFS has a built-in command for this, which is described reasonably well in Oracle's docs. You should be able to use zpool replace tank <old device> <new device> to do the main part of the operation, but there are a couple other ancillary commands as well:

The following are the basic steps for replacing a disk:

  • Offline the disk, if necessary, with the zpool offline command.
  • Remove the disk to be replaced.
  • Insert the replacement disk.
  • Run the zpool replace command. For example: zpool replace tank c1t1d0
  • Bring the disk online with the zpool online command.

The man page also has some additional information:

zpool replace [-f]  pool device [new_device]

 Replaces old_device with new_device.  This is equivalent to attaching
 new_device, waiting for it to resilver, and then detaching
 old_device.

 The size of new_device must be greater than or equal to the minimum
 size of all the devices in a mirror or raidz configuration.

 new_device is required if the pool is not redundant. If new_device is
 not specified, it defaults to old_device.  This form of replacement
 is useful after an existing disk has failed and has been physically
 replaced. In this case, the new disk may have the same /dev path as
 the old device, even though it is actually a different disk.  ZFS
 recognizes this.

 -f  Forces use of new_device, even if its appears to be in use.
     Not all devices can be overridden in this manner.

Of course, it's probably best to try this first on a VM that has virtual disks in a similarly-configured zpool, rather than trying it for the first time on the pool with data you care about recovering.

By the way, this other part of the docs explains a bit more about hot spares and perhaps includes pointers to explain why yours didn't get used. It might be valuable to poke around a bit to make sure it doesn't crap out again next time :(.