Freebsd – Force ZFS to ignore checksum errors without removing the offending disk

bsdfilesystemsfreebsdtruenaszfs

EDIT: (see the end of this question) after more digging this appears to be a system USB issue and not ZFS that's causing the drives to be kicked. I'll leave this question up for posterity, because I'm still curious if there's an answer, but in the meantime if people have advice from FreeBSD USB devices getting forcefully removed, I'm all ears!

Please approach this question with a sense of humor and don't just downvote because it's a bad idea, sometimes (very rarely!) a user is totally ok with data loss and just needs help loading their footgun! After all, ZFS provides other benefits beyond data integrity, and I'd still rather use it for my bad drives than ext4. If you're the type of sysdmin who reads this with a sly smile and remembers the time they lost data by doing exactly this, this question is for you.

I'm running a pool with some USB drives on a non-critical server with non-critical data, and I don't care if it gets corrupted. I'm trying to set it up so that ZFS does not force-remove USB drives when they experience checksum errors (just like how ext4 or FAT handle this scenario, by not noticing/caring about data loss).

Disclaimer:

To readers landing here via Google trying to fix their ZFS pool, do not attempt anything described in this question or its answers, you will lose your data!

Because the ZFS police love to yell at people who are using USB drives
or have any other non-standard setup: for the sake of this discussion,
assume it's cat videos that I have backed up in 32 other physically
remote places on 128 redundant SSDs. I fully acknowledge that I will lose 100% of
my data unrecoverably on this pool (many times over) if I try to do this.
I'm directing this question to the people who are curious about
just how bad an environment ZFS is capable of running in (the people
who like pushing systems to their breaking points and beyond, just for
fun).

So heres the setup:

  • HP EliteDesk server running FreeNAS-11.2-U5
  • 2x WD Elements 8TB drives connected via USB 3.0
  • unreliable power environment, server and drives are often force rebooted/disconnected with no warning. (yes I have a UPS, no I don't want to use it, I want to break this server, didn't you read the disclaimer 😉?)
  • one mirror pool hdd with the two drives (with failmode=continue set)
  • one drive is stable, even after multiple reboots and force-disconnects, it never seems to report checksum errors or any other issues in ZFS
  • one drive is unreliable, with occasional checksum errors during normal operation (even when not disconnected unexpectedly), the errors seem to appear unrelated to the bad power environment, as it'll be running fine for 10+ hours and suddenly get ejected from the pool due to checksum failures

I've confirmed the unreliable drive is due to a software issue or a hardware issue with the USB bus on the server, and not an unreliable cable or a physical problem with the drive. The way I've confirmed this is by plugging it into my MacBook with known-good USB ports, and zeroing then writing random data to the entire drive and verifying it (done 3 times, 100% success each time). The drive is almost new, with no other SMART indicators below 100% health. However, even if the drive were failing gradually and losing a few bits here and there, I'm ok with that.

Heres the problem:

Whenever the bad drive has checksum errors ZFS removes it from the pool (Edit: this turned out to be an incorrect assumption, the system kicked it, not ZFS). Unfortunately, FreeNAS does not allow me to re-add it to the pool without physically rebooting, or unplugging and reconnecting both the USB cable, and the drive's power supply. This means I can't script the re-adding process or do it remotely without rebooting the entire server, I'd have to be physically present to unplug things or have an internet-connected Arduino and a relay wired into both cables.

Possible solutions

I've already done quite a bit of research on whether this sort of thing is possible, and it's been difficult because every time I find a relevant thread, the data integrity police jump in and convince the asker to abandon their unreliable setup instead of ignoring the errors or working around them. I'm resorting to asking here because I haven't been able to find documentation or other answers on how to accomplish this.

  • turning off checksums entirely with zfs set checksum=off hdd, I haven't done this yet because I'd ideally like to keep checksums so I know when the drive is misbehaving, I just want to ignore the failures
  • a flag that keeps checksumming but ignores checksum errors / attempts to repair them without removing the drive from the pool
  • a ZFS flag that raises the maximum allowable checksum error limit before the drive gets removed (currently the drive gets booted after about ~13 errors)
  • a FreeBSD/FreeNAS command that allows me to force-online the device after it got removed, without having to reboot the entire server
  • a FreeBSD/FreeNAS kernel option to force this drive to never be allowed to be removed
  • a FreeBSD sysctl option that magically fixes the USB bus issue causing errors/timeouts on only this drive (unlikely)
  • a ZFS on linux option that does the same thing (I'd be willing to move these drives to my Ubuntu box if I know it's possible to do there)
  • running zpool clear hdd in a loop every 500ms to clear checksum errors before they reach the threshold
  • I'm also experimenting with setting hw.usb.xhci.use_polling=1 to fix the USB reconnection failure after disconnect, but don't have conclusive results yet

I'm really trying to avoid having to resort to using ext4 or another filesystem that doesn't force-remove drives after USB errors, because I want all the other ZFS features like snapshots, datasets, send/recv, etc, I'm just trying to ignore/repair data integrity errors without disconnecting drives.

Relevant logs

This is the dmesg output whenever the drive misbehaves and gets removed

Jul  7 04:10:35 freenas-lemon ZFS: vdev state changed, pool_guid=13427464797767151426 vdev_guid=11823196300981694957
Jul  7 04:10:35 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0 (disconnected)
Jul  7 04:10:35 freenas-lemon umass4: at uhub2, port 20, addr 7 (disconnected)
Jul  7 04:10:35 freenas-lemon da4 at umass-sim4 bus 4 scbus7 target 0 lun 0
Jul  7 04:10:35 freenas-lemon da4: <WD Elements 25A3 1021> s/n 5641474A4D56574C detached
Jul  7 04:10:35 freenas-lemon (da4:umass-sim4:4:0:0): Periph destroyed
Jul  7 04:10:35 freenas-lemon umass4: detached
Jul  7 04:10:46 freenas-lemon usbd_req_re_enumerate: addr=9, set address failed! (USB_ERR_IOERROR, ignored)
Jul  7 04:10:52 freenas-lemon usbd_setup_device_desc: getting device descriptor at addr 9 failed, USB_ERR_TIMEOUT
Jul  7 04:10:52 freenas-lemon usbd_req_re_enumerate: addr=9, set address failed! (USB_ERR_IOERROR, ignored)
Jul  7 04:10:58 freenas-lemon usbd_setup_device_desc: getting device descriptor at addr 9 failed, USB_ERR_TIMEOUT
Jul  7 04:10:58 freenas-lemon usb_alloc_device: Failure selecting configuration index 0:USB_ERR_TIMEOUT, port 20, addr 9 (ignored)
Jul  7 04:10:58 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0
Jul  7 04:10:58 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0 (disconnected)

This is the zpool status hdd output after the bad drive gets kicked.

  pool: hdd
 state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 0 in 0 days 00:53:45 with 0 errors on Sun Jul  7 17:19:41 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    hdd                                             DEGRADED     0     0     0
      mirror-0                                      DEGRADED     0     0     0
        gptid/6a8016b8-a08d-11e9-8e1c-ecb1d765a86d  ONLINE       0     0     0
        11823196300981694957                        REMOVED      0     0     0  was /dev/gptid/6c3950c1-a08d-11e9-8e1c-ecb1d765a86d

errors: No known data errors

Edit:

After some more digging it looks like other people have experienced this sort of error too. It appears to be either a kernel bug or USB hardware/software problem with some drives, and not a problem at the ZFS level. The system is kicking the d
rives, which then causes the ZFS checksum errors, and not the other way around.
ZFS has no problem re-importing the drives after reboot, and it happily fixes the errors and reports no data loss. The USB issues might possibly related to power management features or other USB commands not being supported on the drive, but I'm still skeptical because the two drives are practically identical WD Elements drives bought only a year apart. I'm not sure how to fix it since camcontrol rescan all doesn't even find the USB device attached anymore after it gets disconnected, it really takes a full reboot, and often a full power cycling of the external drive in addition to the reboot.

dmesg output during the failure:

ugen0.8: <Western Digital Elements 25A3> at usbus0 (disconnected)
umass4: at uhub0, port 20, addr 7 (disconnected)
(da4:umass-sim4:4:0:0): READ(10). CDB: 28 00 42 78 cd 98 00 01 00 00
(da4:umass-sim4:4:0:0): CAM status: CCB request completed with an error
(da4:umass-sim4:4:0:0): Retrying command
(da4:umass-sim4:4:0:0): READ(10). CDB: 28 00 42 78 cd 98 00 01 00 00
(da4:umass-sim4:4:0:0): CAM status: CCB request completed with an error
(da4:umass-sim4:4:0:0): Retrying command

...(same thing repeated)...

(da4:umass-sim4:4:0:0): READ(10). CDB: 28 00 42 78 f1 98 00 01 00 00
(da4:umass-sim4:4:0:0): CAM status: CCB request completed with an error
(da4:umass-sim4:4:0:0): Error 5, Retries exhausted
da4 at umass-sim4 bus 4 scbus7 target 0 lun 0
da4: <WD Elements 25A3 1021> s/n 5641474A4D56574C detached
(da4:umass-sim4:4:0:0): Periph destroyed
umass4: detached

Best Answer

I myself willingly am shooting myself in the foot by using a USB enclosure, and because FreeNAS (well, FreeBSD) is rightfully angry at said enclosure for not implementing the SCSI protocol correctly (almost none do for some reason) I switched to Linux (with ZoL). Despite the drives not regularly being disconnected (as far as dmesg indicates), the first one is reporting read and thence checsum errors, in such batches that it kicked the one (out of the four attached to the same interface) out of the pool.

I ended up adjusting zfs_checksum_events_per_second from 20 to 0 so that a scrub could complete without it spooking away.

I was unable to determine anything on the BSD side that would kick drives off, but the OP mentioned it was the actual USB interface that was giving up.