Linux – Why did rebooting cause one side of the ZFS mirror to become UNAVAIL

linuxmirrorudevzfszfsonlinux

I just recently migrated a bulk data storage pool (ZFS On Linux 0.6.2, Debian Wheezy) from a single-device vdev configuration to a two-way mirror vdev configuration.

The previous pool configuration was:

    NAME                     STATE     READ WRITE CKSUM
    akita                    ONLINE       0     0     0
      ST4000NM0033-Z1Z1A0LQ  ONLINE       0     0     0

Everything was fine after the resilver completed (I initiated a scrub after the resilver completed, just to have the system go over everything once again and make sure it was all good):

  pool: akita
 state: ONLINE
  scan: scrub repaired 0 in 6h26m with 0 errors on Sat May 17 06:16:06 2014
config:

        NAME                       STATE     READ WRITE CKSUM
        akita                      ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            ST4000NM0033-Z1Z1A0LQ  ONLINE       0     0     0
            ST4000NM0033-Z1Z333ZA  ONLINE       0     0     0

errors: No known data errors

However, after rebooting I got an email notifying me of the fact that the pool was not fine and dandy. I had a look and this is what I saw:

   pool: akita
  state: DEGRADED
 status: One or more devices could not be used because the label is missing or
         invalid.  Sufficient replicas exist for the pool to continue
         functioning in a degraded state.
 action: Replace the device using 'zpool replace'.
    see: http://zfsonlinux.org/msg/ZFS-8000-4J
   scan: scrub in progress since Sat May 17 14:20:15 2014
     316G scanned out of 1,80T at 77,5M/s, 5h36m to go
     0 repaired, 17,17% done
 config:

         NAME                       STATE     READ WRITE CKSUM
         akita                      DEGRADED     0     0     0
           mirror-0                 DEGRADED     0     0     0
             ST4000NM0033-Z1Z1A0LQ  ONLINE       0     0     0
             ST4000NM0033-Z1Z333ZA  UNAVAIL      0     0     0

 errors: No known data errors

The scrub is expected; there is a cron job setup to initiate a full system scrub on reboot. However, I definitely wasn't expecting the new HDD to fall out of the mirror.

I define aliases that map to the /dev/disk/by-id/wwn-* names, and in case of both these disks have given ZFS free reign to use the full disk, including handling partitioning:

# zpool history akita | grep ST4000NM0033
2013-09-12.18:03:06 zpool create -f -o ashift=12 -o autoreplace=off -m none akita ST4000NM0033-Z1Z1A0LQ
2014-05-15.15:30:59 zpool attach -o ashift=12 -f akita ST4000NM0033-Z1Z1A0LQ ST4000NM0033-Z1Z333ZA
#

These are the relevant lines from /etc/zfs/vdev_id.conf (I do notice now that the Z1Z333ZA uses a tab character for separation whereas the Z1Z1A0LQ line uses only spaces, but I honestly don't see how that could be relevant here):

alias ST4000NM0033-Z1Z1A0LQ             /dev/disk/by-id/wwn-0x5000c500645b0fec
alias ST4000NM0033-Z1Z333ZA     /dev/disk/by-id/wwn-0x5000c50065e8414a

When I looked, /dev/disk/by-id/wwn-0x5000c50065e8414a* were there as expected, but /dev/disk/by-vdev/ST4000NM0033-Z1Z333ZA* were not.

Issuing sudo udevadm trigger caused the symlinks to show up in /dev/disk/by-vdev. However, ZFS doesn't seem to just realize that they are there (Z1Z333ZA still shows as UNAVAIL). That much I suppose can be expected.

I tried replacing the relevant device, but had no real luck:

# zpool replace akita ST4000NM0033-Z1Z333ZA
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/ST4000NM0033-Z1Z333ZA-part1 is part of active pool 'akita'
#

Both disks are detected during the boot process (dmesg log output showing the relevant drives):

[    2.936065] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.936137] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.937446] ata4.00: ATA-9: ST4000NM0033-9ZM170, SN03, max UDMA/133
[    2.937453] ata4.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.938516] ata4.00: configured for UDMA/133
[    2.992080] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    3.104533] ata6.00: ATA-9: ST4000NM0033-9ZM170, SN03, max UDMA/133
[    3.104540] ata6.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    3.105584] ata6.00: configured for UDMA/133
[    3.105792] scsi 5:0:0:0: Direct-Access     ATA      ST4000NM0033-9ZM SN03 PQ: 0 ANSI: 5
[    3.121245] sd 3:0:0:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[    3.121372] sd 3:0:0:0: [sdb] Write Protect is off
[    3.121379] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    3.121426] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    3.122070] sd 5:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[    3.122176] sd 5:0:0:0: [sdc] Write Protect is off
[    3.122183] sd 5:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[    3.122235] sd 5:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Both drives are connected directly to the motherboard; there is no off-board controller involved.

On impulse, I did:

# zpool online akita ST4000NM0033-Z1Z333ZA

which appears to have worked; Z1Z333ZA is now at least ONLINE and resilvering. At about an hour into the resilver it's scanned 180G and resilvered 24G with 9.77% done, which points to it not doing a full resilver but rather only transferring the dataset delta.

I'm honestly not sure if this issue is related to ZFS On Linux or to udev (it smells a bit like udev, but then why would one drive be detected just fine but not the other), but my question is how do I make sure the same thing doesn't happen again on the next reboot?

I'll be happy to provide more data on the setup if necessary; just let me know what's needed.

Best Answer

This is a udev issue that seems to be specific to Debian and Ubuntu variants. Most of my ZFS on Linux work is with CentOS/RHEL.

See:
scsi and ata entries for same hard drive under /dev/disk/by-id
and
ZFS on Linux/Ubuntu: Help importing a zpool after Ubuntu upgrade from 13.04 to 13.10, device IDs have changed

I'm not sure what the most deterministic pool device approach for Debian/Ubuntu systems is. For RHEL, I prefer to use device WWNs on general pool devices. But other times, the device name/serial is useful, too. But udev should be able to keep all of this in check.

# zpool status
  pool: vol1
 state: ONLINE
  scan: scrub repaired 0 in 0h32m with 0 errors on Sun Feb 16 17:34:42 2014
config:

        NAME                        STATE     READ WRITE CKSUM
        vol1                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x500000e014609480  ONLINE       0     0     0
            wwn-0x500000e0146097d0  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x500000e0146090c0  ONLINE       0     0     0
            wwn-0x500000e01460fd60  ONLINE       0     0     0

Related Solutions

Missing whole disk device in OpenSolaris

Try:

devfsadm -v

otherwise, it might be an EFI partition is required. It should be created with:

format -e

fdisk -E raw-device

What happens to missed writes after a zpool clear

"Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though."

Actually, this is almost exactly what it can do in this situation. See, every time the disk in a ZFS pool is written to, the current global pool transaction id is written to the disk. So say, for instance, that you have the scenario you explain occur, and the total time between the connection loss and recovery is less than 127 * txg_timeout (and that's making a lot of gross assumptions about load on the pool and a few other things, but say half that for typical safety's sake, so if txg_timeout is 10 seconds, then 600 seconds or 10 minutes is a reasonable time to expect this to still work).

At the moment before disconnection, the pool was able to successfully write writes related to transaction id 20192. Time passes, and the disk comes back. At the time the disk is once again available, the pool has had a number of transaction groups go through, and is at transaction id 20209. At this point, there is still every possibility ZFS can do what is called a 'quick resilver', where it resilvers the disk, but ONLY for transaction id's 20193 through 20209, as opposed to a full resilver of the drive. This quickly and efficiently gets the disk back up in spec with the rest of the pool.

However, the method to kick off that activity is not 'zpool clear'. If everything works as it should, the resilver should have been kicked in automatically the moment the disk became healthy again. In fact, it may have been so fast, you never saw it. In which case, 'zpool clear' would be the proper activity to clear up the still-visible error count that would have appeared when the device disappeared in the first place. Depending on the version of zfs you're using, what OS it is on, in what manner the device is being listed by zfs at the moment and how long it has been in that state, the 'proper' way to fix this varies. It could actually be 'zpool clear' (clearing up the errors, and the next access of the drive should notice the out of sync txg id and kick in the resilver) or you might need to use 'zpool online' or 'zpool replace'.

What I'm used to seeing, when all this works properly, is the disk disappearing and the drive going into a state of OFFLINE or DEGRADED or FAULTED or UNAVAIL or REMOVED. Then, when the drive becomes accessible again at an OS level, FMA and other OS mechanisms kick in and ZFS becomes aware the disk has returned, and there's a quick resilver and the device appears in zpool status as ONLINE again, but may still have an error count associated with it. The key is it is in ONLINE status, which would indicate automatic repair (resilver) success. You can test it on any drive by pulling it out, waiting a few seconds and checking 'zpool status', and then plugging the disk back in and checking 'zpool status' again and seeing what happens. ZFS isn't the only moving piece here - ZFS actually relies in large part on other OS mechanics to inform it of the disk's status, and if those mechanics fail you'll get different symptoms than if they succeed.

In either event, either the quick resilver is able to be run and succeeds, or it is not possible or fails. If the latter, the disk will have to complete a full resilver before returning to duty, so your two problems listed at the bottom of your post shouldn't usually be possible unless administrative override has allowed a disk with a mismatched txgid to re-enter a pool without any form of correction for that disparity (should not usually be possible). IF that were to happen, I would suspect the next access to the drive would either result in a kick off that quick resilver (and succeed, or fail and knock the disk to a full resilver) or it would end up kicking the disk out -- or possibly panicking, due to the txgid disparity. In any of those events, what would not happen is data loss or a return of incorrect data to a request.

Best Answer

Related Solutions

Missing whole disk device in OpenSolaris

What happens to missed writes after a zpool clear

Related Topic