I just replaced a hard-drive, that was part of two different redundant pools, and now both pools are unavailable…
Details:
- There are four drives: 2x4TB (
da0
andada1
) and 2x3TB (da1
andda2
). - One pool is a RAIDZ1 consisting of both of the 3TB drives in their entireties and the 3TB-parts of the 4TB-drives.
- The other pool is a mirror consisting of the remaining space of the two bigger drives.
- I replaced one of the 4TB-drives with another of the same size (
da0
)…
I expected both pools to go into "degraded" mode until I spliced the replacement into the two parts and added each part to its pool.
Instead the computer rebooted unceremoniously and, upon coming back, both pools are "unavailable":
pool: aldan
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-3C
scan: none requested
config:
NAME STATE READ WRITE CKSUM
aldan UNAVAIL 0 0 0
raidz1-0 UNAVAIL 0 0 0
1257549909357337945 UNAVAIL 0 0 0 was /dev/ada1p1
1562878286621391494 UNAVAIL 0 0 0 was /dev/da1
8160797608248051182 UNAVAIL 0 0 0 was /dev/da0p1
15368186966842930240 UNAVAIL 0 0 0 was /dev/da2
logs
4588208516606916331 UNAVAIL 0 0 0 was /dev/ada0e
pool: lusterko
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-3C
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lusterko UNAVAIL 0 0 0
mirror-0 UNAVAIL 0 0 0
623227817903401316 UNAVAIL 0 0 0 was /dev/ada1p2
7610228227381804026 UNAVAIL 0 0 0 was /dev/da0p2
I split the new drive now, but attempts to "zpool replace" are rebuffed with "pool is unavailable". I'm pretty sure, if I simply disconnect the new drive, both pools will become Ok (if degraded). Why are they both "unavailable" now? All of the devices are online, according to camcontrol:
<ATA TOSHIBA MG03ACA4 FL1A> at scbus0 target 0 lun 0 (pass0,da0)
<ATA Hitachi HUS72403 A5F0> at scbus0 target 1 lun 0 (pass1,da1)
<ATA TOSHIBA HDWD130 ACF0> at scbus0 target 2 lun 0 (pass2,da2)
<M4-CT128M4SSD2 0309> at scbus1 target 0 lun 0 (pass3,ada0)
<MB4000GCWDC HPGI> at scbus2 target 0 lun 0 (pass4,ada1)
The OS is FreeBSD-11.3-STABLE/amd64. What's wrong?
Update: no, I didn't explicitly offline
the device(s) before unplugging the disk — and it is already on its way back to Amazon. I'm surprised, such offlining is necessary — should not ZFS be able to handle the sudden death of any drive? And it shouldn't it, likewise, be prepared for a technician replacing the failed drive with another? Why is it throwing a fit like this?
I have backups and can rebuild the pools from scratch — but I'd like to figure out, how to avoid doing this. Or, if not possible, to file a proper bug-report…
I unplugged the new drive completely, but the pool's status hasn't changed… Maybe, I need to reboot — whether or not that helps, it is quite a disappointment.
Update 2: multiple reboots, with and without the new disk attached, did not help. However, zpool import
lists both pools just as I'd expect them: degraded (but available!). For example:
pool: lusterko
id: 11551312344985814621
state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices. The
fault tolerance of the pool may be compromised if imported.
see: http://illumos.org/msg/ZFS-8000-2Q
config:
lusterko DEGRADED
mirror-0 DEGRADED
ada1p2 ONLINE
12305582129131953320 UNAVAIL cannot open
But zpool status
continues to insist, all devices are unavailable… Any hope?
Best Answer
Maybe also you did not offline the old drive prior to removing it. (It's a possibility that ZFS thinks that the logical drives (your pools) are corrupted, and the controller thinks they are fine. This happens if there's a difference in disk cylinder size - rare case but can happen.)
To get out of the situation:
zpool status
cfgadm -c unconfigure
andcfgadm -c configure
zpool online zone
zpool replace zone
(zpool status zone
should show online)zpool replace
command to replace the disk