What happens to missed writes after a zpool clear

software-raidzfs

I am trying to understand ZFS' behaviour under a specific condition, but the documentation is not very explicit about this so I'm left guessing.

Suppose we have a zpool with redundancy. Take the following sequence of events:

  1. A problem arises in the connection between device D and the server. This causes a large number of failures and ZFS therefore faults the device, putting the pool in degraded state.

  2. While the pool is in degraded state, the pool is mutated (data is written and/or changed.)

  3. The connectivity issue is physically repaired such that device D is reliable again.

  4. Knowing that most data on D is valid, and not wanting to stress the pool with a resilver needlessly, the admin instead runs zpool clear pool D. This is indicated by Oracle's documentation as the appropriate action where the fault was due to a transient problem that has been corrected.

I've read that zpool clear only clears the error counter, and restores the device to online status. However, this is a bit troubling, because if that's all it does, it will leave the pool in an inconsistent state!

This is because mutations in step 2 will not have been successfully written to D. Instead, D will reflect the state of the pool prior to the connectivity failure. This is of course not the normative state for a zpool and could lead to hard data loss upon failure of another device – however, the pool status will not reflect this issue!

I would at least assume based on ZFS' robust integrity mechanisms that an attempt to read the mutated data from D would catch the mistakes and repair them. However, this raises two problems:

  1. Reads are not guaranteed to hit all mutations unless a scrub is done; and

  2. Once ZFS does hit the mutated data, it (I'm guessing) might fault the drive again because it would appear to ZFS to be corrupting data, since it doesn't remember the previous write failures.

Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though.

I'm hoping someone with intimate knowledge of ZFS can shed some light on this aspect.

Best Answer

"Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though."

Actually, this is almost exactly what it can do in this situation. See, every time the disk in a ZFS pool is written to, the current global pool transaction id is written to the disk. So say, for instance, that you have the scenario you explain occur, and the total time between the connection loss and recovery is less than 127 * txg_timeout (and that's making a lot of gross assumptions about load on the pool and a few other things, but say half that for typical safety's sake, so if txg_timeout is 10 seconds, then 600 seconds or 10 minutes is a reasonable time to expect this to still work).

At the moment before disconnection, the pool was able to successfully write writes related to transaction id 20192. Time passes, and the disk comes back. At the time the disk is once again available, the pool has had a number of transaction groups go through, and is at transaction id 20209. At this point, there is still every possibility ZFS can do what is called a 'quick resilver', where it resilvers the disk, but ONLY for transaction id's 20193 through 20209, as opposed to a full resilver of the drive. This quickly and efficiently gets the disk back up in spec with the rest of the pool.

However, the method to kick off that activity is not 'zpool clear'. If everything works as it should, the resilver should have been kicked in automatically the moment the disk became healthy again. In fact, it may have been so fast, you never saw it. In which case, 'zpool clear' would be the proper activity to clear up the still-visible error count that would have appeared when the device disappeared in the first place. Depending on the version of zfs you're using, what OS it is on, in what manner the device is being listed by zfs at the moment and how long it has been in that state, the 'proper' way to fix this varies. It could actually be 'zpool clear' (clearing up the errors, and the next access of the drive should notice the out of sync txg id and kick in the resilver) or you might need to use 'zpool online' or 'zpool replace'.

What I'm used to seeing, when all this works properly, is the disk disappearing and the drive going into a state of OFFLINE or DEGRADED or FAULTED or UNAVAIL or REMOVED. Then, when the drive becomes accessible again at an OS level, FMA and other OS mechanisms kick in and ZFS becomes aware the disk has returned, and there's a quick resilver and the device appears in zpool status as ONLINE again, but may still have an error count associated with it. The key is it is in ONLINE status, which would indicate automatic repair (resilver) success. You can test it on any drive by pulling it out, waiting a few seconds and checking 'zpool status', and then plugging the disk back in and checking 'zpool status' again and seeing what happens. ZFS isn't the only moving piece here - ZFS actually relies in large part on other OS mechanics to inform it of the disk's status, and if those mechanics fail you'll get different symptoms than if they succeed.

In either event, either the quick resilver is able to be run and succeeds, or it is not possible or fails. If the latter, the disk will have to complete a full resilver before returning to duty, so your two problems listed at the bottom of your post shouldn't usually be possible unless administrative override has allowed a disk with a mismatched txgid to re-enter a pool without any form of correction for that disparity (should not usually be possible). IF that were to happen, I would suspect the next access to the drive would either result in a kick off that quick resilver (and succeed, or fail and knock the disk to a full resilver) or it would end up kicking the disk out -- or possibly panicking, due to the txgid disparity. In any of those events, what would not happen is data loss or a return of incorrect data to a request.

Related Topic