Linux ZFS – How to Resolve Unrecognized Physical Disks

degradedlinuxuuidzfs

I have repeating problem with the zfs pool where zfs stops recognizing its own, properly labeled (or so it appears) physical devices.

Ubuntu 20.04.2 LTS
5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
libzfs2linux/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]
zfs-zed/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]
zfsutils-linux/now 0.8.3-1ubuntu12.11 amd64 [installed,upgradable to: 0.8.3-1ubuntu12.13]

Model examples.

I can create a pool, hook up completely unrelated disk (e.g. usb, external) and upon rebooting (with the usb disk in) zfs reports one of the disks from its pool missing.
Same seems to happen with the change the controller for one (or perhaps more) of the drives.
All the physical disks are there, all the labels/uuids seems to be there, what changes is the device letter assignment.

It's hard to believe zfs assembles the pool based on the system device assignment order ignoring its labels/uuids but this is how it simply looks like.

    agatek@mmstorage:~$ zpool status
          pool: mmdata
         state: DEGRADED
        status: One or more devices could not be used because the label is missing or
            invalid.  Sufficient replicas exist for the pool to continue
            functioning in a degraded state.
        action: Replace the device using 'zpool replace'.
           see: http://zfsonlinux.org/msg/ZFS-8000-4J
          scan: scrub in progress since Sun Jan  9 13:03:23 2022
            650G scanned at 1.58G/s, 188G issued at 468M/s, 22.7T total
            0B repaired, 0.81% done, 0 days 14:00:27 to go
        config:

        NAME                                          STATE     READ WRITE CKSUM
        mmdata                                        DEGRADED     0     0     0
          raidz1-0                                    DEGRADED     0     0     0
            ata-HGST_HDN726040ALE614_K7HJG8HL         ONLINE       0     0     0
            6348126275544519230                       FAULTED      0     0     0  was /dev/sdb1
            ata-HGST_HDN726040ALE614_K3H14ZAL         ONLINE       0     0     0
            ata-HGST_HDN726040ALE614_K4K721RB         ONLINE       0     0     0
            ata-WDC_WD40EZAZ-00SF3B0_WD-WX12D514858P  ONLINE       0     0     0
            ata-ST4000DM004-2CV104_ZTT24X5R           ONLINE       0     0     0
            ata-WDC_WD40EZAZ-00SF3B0_WD-WX62D711SHF4  ONLINE       0     0     0
            sdi                                       ONLINE       0     0     0
    
    errors: No known data errors

agatek@mmstorage:~$ blkid 
/dev/sda1: UUID="E0FD-8D4F" TYPE="vfat" PARTUUID="7600a192-967b-417f-b726-7f5524be71a5"
/dev/sda2: UUID="9d8774ec-051f-4c60-aaa7-82f37dbaa4a4" TYPE="ext4" PARTUUID="425f31b2-f289-496a-911b-a2f8a9bb5c25"
/dev/sda3: UUID="e0b8852d-f781-4891-8e77-d8651f39a55b" TYPE="ext4" PARTUUID="a750bae3-c6ea-40a0-bdfa-0523e358018b"
/dev/sdb1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="13253481390530831214" TYPE="zfs_member" PARTLABEL="zfs-5360ecc220877e69" PARTUUID="57fe2215-aa69-2f46-b626-0f2057a2e4a7"
/dev/sdd1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="17929921080902463088" TYPE="zfs_member" PARTLABEL="zfs-f6ef14df86c7a6e1" PARTUUID="31a074a3-300d-db45-b9e2-3495f49c4bee"
/dev/sde1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="505855664557329830" TYPE="zfs_member" PARTLABEL="zfs-6326993c142e4a03" PARTUUID="37f4954d-67fd-8945-82e6-d0db1f2af12e"
/dev/sdg1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="1905592300789522892" TYPE="zfs_member" PARTLABEL="zfs-9d379d5bfd432a2b" PARTUUID="185eff00-196a-a642-9360-0d4532d54ec0"
/dev/sdi1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="15862525770363300383" TYPE="zfs_member" PARTLABEL="zfs-3c99aa22a45c59bf" PARTUUID="89f1600a-b58e-c74c-8d5e-6fdd186a6db0"
/dev/sdh1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="15292769945216849639" TYPE="zfs_member" PARTLABEL="zfs-ee9e1c9a5bde878c" PARTUUID="2e70d63b-00ba-f842-b82d-4dba33314dd5"
/dev/sdf1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="5773484836304595337" TYPE="zfs_member" PARTLABEL="zfs-ee40cf2140012e24" PARTUUID="e5cc3e2a-f7c9-d54e-96de-e62a723a9c3f"
/dev/sdc1: LABEL="mmdata" UUID="16683979255455566941" UUID_SUB="6348126275544519230" TYPE="zfs_member" PARTLABEL="zfs-0d28f0d2715eaff8" PARTUUID="a328981a-7569-294a-bbf6-9d26660e2aad"`

For the above pool, what happened, one of the devices earlier failed. I hooked up a replacement disk to the second controller and performed the replacement. It was successful. The pool was ok. Next, the failed device was removed from the pool and physically replaced by the replacement disk (change of the controller). After rebooting I got it in the degraded state with one of the devices reported missing. The scrub was triggered by the command zpool clear.

So as it shows from blkid, there are 8 disks, all partitioned and labeled (I think) properly, but one of the devices is not recognized as the part of the pool.
What to do in such situations? It is extremely annoying. Resilvering the pool takes days.

Best Answer

If you add any device to the pool using /dev/sdX path, it is subject to changing, because Linux kernel does not guarantee any order for those drive entries.

In your output, you have /dev/sdi as a member of the pool. This can change at any point.

You should try zpool export mmdata to put the array offline, and then zpool import mmdata -d /dev/disk/by-id to import it again using the persistent IDs for the drives.

Related Solutions

Missing whole disk device in OpenSolaris

Try:

devfsadm -v

otherwise, it might be an EFI partition is required. It should be created with:

format -e

fdisk -E raw-device

What happens to missed writes after a zpool clear

"Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though."

Actually, this is almost exactly what it can do in this situation. See, every time the disk in a ZFS pool is written to, the current global pool transaction id is written to the disk. So say, for instance, that you have the scenario you explain occur, and the total time between the connection loss and recovery is less than 127 * txg_timeout (and that's making a lot of gross assumptions about load on the pool and a few other things, but say half that for typical safety's sake, so if txg_timeout is 10 seconds, then 600 seconds or 10 minutes is a reasonable time to expect this to still work).

At the moment before disconnection, the pool was able to successfully write writes related to transaction id 20192. Time passes, and the disk comes back. At the time the disk is once again available, the pool has had a number of transaction groups go through, and is at transaction id 20209. At this point, there is still every possibility ZFS can do what is called a 'quick resilver', where it resilvers the disk, but ONLY for transaction id's 20193 through 20209, as opposed to a full resilver of the drive. This quickly and efficiently gets the disk back up in spec with the rest of the pool.

However, the method to kick off that activity is not 'zpool clear'. If everything works as it should, the resilver should have been kicked in automatically the moment the disk became healthy again. In fact, it may have been so fast, you never saw it. In which case, 'zpool clear' would be the proper activity to clear up the still-visible error count that would have appeared when the device disappeared in the first place. Depending on the version of zfs you're using, what OS it is on, in what manner the device is being listed by zfs at the moment and how long it has been in that state, the 'proper' way to fix this varies. It could actually be 'zpool clear' (clearing up the errors, and the next access of the drive should notice the out of sync txg id and kick in the resilver) or you might need to use 'zpool online' or 'zpool replace'.

What I'm used to seeing, when all this works properly, is the disk disappearing and the drive going into a state of OFFLINE or DEGRADED or FAULTED or UNAVAIL or REMOVED. Then, when the drive becomes accessible again at an OS level, FMA and other OS mechanisms kick in and ZFS becomes aware the disk has returned, and there's a quick resilver and the device appears in zpool status as ONLINE again, but may still have an error count associated with it. The key is it is in ONLINE status, which would indicate automatic repair (resilver) success. You can test it on any drive by pulling it out, waiting a few seconds and checking 'zpool status', and then plugging the disk back in and checking 'zpool status' again and seeing what happens. ZFS isn't the only moving piece here - ZFS actually relies in large part on other OS mechanics to inform it of the disk's status, and if those mechanics fail you'll get different symptoms than if they succeed.

In either event, either the quick resilver is able to be run and succeeds, or it is not possible or fails. If the latter, the disk will have to complete a full resilver before returning to duty, so your two problems listed at the bottom of your post shouldn't usually be possible unless administrative override has allowed a disk with a mismatched txgid to re-enter a pool without any form of correction for that disparity (should not usually be possible). IF that were to happen, I would suspect the next access to the drive would either result in a kick off that quick resilver (and succeed, or fail and knock the disk to a full resilver) or it would end up kicking the disk out -- or possibly panicking, due to the txgid disparity. In any of those events, what would not happen is data loss or a return of incorrect data to a request.

Best Answer

Related Solutions

Missing whole disk device in OpenSolaris

What happens to missed writes after a zpool clear

Related Topic