I have the following zpool:
NAME STATE READ WRITE CKSUM
zfspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000cca266f3d8ee ONLINE 0 0 0
wwn-0x5000cca266f1ae00 ONLINE 0 0 0
This morning the host experienced an event (still digging into it. Load was very high and lots of stuff wasn't working, but I could still get into it).
On reboot the host hung during boot waiting on services that relied on data on the above pool.
suspecting an issue with the pool, I removed one of the drives and rebooted again. Host came online this time.
A scrub showed all the data on the existing disk was fine. After that completed, I reinserted the drive that was removed. The drive began resilvering, but gets about 4% through and then restarts.
smartctl shows no issues with either drive (No errors logged, WHEN_FAILED empty).
However, I can't tell which disk is resilvering, and in fact it looks like the pool is fine and doesn't need resilvered at all.
errors: No known data errors
root@host1:/var/log# zpool status
pool: zfspool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Dec 8 12:20:53 2019
46.7G scanned at 15.6G/s, 45.8G issued at 15.3G/s, 5.11T total
0B resilvered, 0.87% done, 0 days 00:05:40 to go
config:
NAME STATE READ WRITE CKSUM
zfspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000cca266f3d8ee ONLINE 0 0 0
wwn-0x5000cca266f1ae00 ONLINE 0 0 0
errors: No known data errors
What is the best course to get out of this resilvering loop? Other answers suggest detaching the drive that is being resilvered, but like I said, it doesn't look like either one is.
edit:
zpool events is about 1000 of the following repeated:
Dec 8 2019 13:22:12.493980068 sysevent.fs.zfs.resilver_start
version = 0x0
class = "sysevent.fs.zfs.resilver_start"
pool = "zfspool"
pool_guid = 0x990e3eff72d0c352
pool_state = 0x0
pool_context = 0x0
time = 0x5ded4d64 0x1d7189a4
eid = 0xf89
Dec 8 2019 13:22:12.493980068 sysevent.fs.zfs.history_event
version = 0x0
class = "sysevent.fs.zfs.history_event"
pool = "zfspool"
pool_guid = 0x990e3eff72d0c352
pool_state = 0x0
pool_context = 0x0
history_hostname = "host1"
history_internal_str = "func=2 mintxg=7381953 maxtxg=9049388"
history_internal_name = "scan setup"
history_txg = 0x8a192e
history_time = 0x5ded4d64
time = 0x5ded4d64 0x1d7189a4
eid = 0xf8a
Dec 8 2019 13:22:17.485979213 sysevent.fs.zfs.history_event
version = 0x0
class = "sysevent.fs.zfs.history_event"
pool = "zfspool"
pool_guid = 0x990e3eff72d0c352
pool_state = 0x0
pool_context = 0x0
history_hostname = "host1"
history_internal_str = "errors=0"
history_internal_name = "scan aborted, restarting"
history_txg = 0x8a192f
history_time = 0x5ded4d69
time = 0x5ded4d69 0x1cf7744d
eid = 0xf8b
Dec 8 2019 13:22:17.733979170 sysevent.fs.zfs.history_event
version = 0x0
class = "sysevent.fs.zfs.history_event"
pool = "zfspool"
pool_guid = 0x990e3eff72d0c352
pool_state = 0x0
pool_context = 0x0
history_hostname = "host1"
history_internal_str = "errors=0"
history_internal_name = "starting deferred resilver"
history_txg = 0x8a192f
history_time = 0x5ded4d69
time = 0x5ded4d69 0x2bbfa222
eid = 0xf8c
Dec 8 2019 13:22:17.733979170 sysevent.fs.zfs.resilver_start
version = 0x0
class = "sysevent.fs.zfs.resilver_start"
pool = "zfspool"
pool_guid = 0x990e3eff72d0c352
pool_state = 0x0
pool_context = 0x0
time = 0x5ded4d69 0x2bbfa222
eid = 0xf8d
...
Best Answer
This is now resolved.
The following issue on github provided the answer:
https://github.com/zfsonlinux/zfs/issues/9551
The red flag in this case is probably the rapidly looping
"starting deferred resilver"
events as seen inzpool events -v
The first suggestion in the link was to disable the zfs-zed service. In my case, it was not enabled to begin with.
The second suggestion was verifying that the zpool had the defer_resilver feature activated. It seems there is a potential issue when a pool is upgraded without the features being enabled that correspond to that upgrade. This pool has moved from multiple machines/operating systems in the past 2 years or so, so it makes sense that it may have been created in an older version of ZFS and is on a newer version of ZFS on the most current host:
After seeing this, I enabled the feature. The github link seemed to suggest this was dangerous, so make sure you have backups.
After that, zpool status showed the resilver progressing further than it had before: