Novice Btrfs User – Troubleshooting Checksum Failures and I/O Errors

backupbtrfschecksumubuntu-20.04

A housemate suggested to me that I ought to use btrfs instead of what I've been doing up until now, which is using mdadm with cloned drives, and adding an extra drive into the array to "clone" a backup. The system has three drives, all physically different models:

  • /dev/sda: TOSHIBA HDWQ140
  • /dev/sdb: HGST HUS724040AL
  • /dev/sdc: WDC WDS250G2B0B

Well I've installed btrfs but now it's been running for close to a year and I find out that I should have had a weekly cron job running to "scrub" it. I started trying to set up a script for this, although it seems like a stupidly DIY system that requires you to google a script (the top hit I found was from something like 2014) and install it to keep your filesystem running.

While I was doing all this admin stuff, I found some files that needed to be moved… I'll skip the gory details, but moving the files from one btrfs filesystem to another and back again generated all sorts of "input/output errors" (never seen that with ext4), and even this gem:

Jan  4 21:19:19 host kernel: [9771285.171522] attempt to access beyond end of device
Jan  4 21:19:19 host kernel: [9771285.171522] sda1: rw=1, want=70370535518208, limit=7814035087
Jan  4 21:19:19 host kernel: [9771285.171529] BTRFS error (device sda1): bdev /dev/sda1 errs: wr 1, rd 0, flush 0, corrupt 5, gen 0

I'm assuming these are related. But here's the real stupid thing. I'm getting checksum errors not just on files that have been sitting around for a year, but on files that I literally copied just hours ago to a different physical drive. Also, nearly all of them are on enormous files (things like DVD iso images) if that is any indication of anything?

So yeah, I could be seeing a simultaneous triple drive failure or does btrfs just go around corrupting my files for me?

Also, every post from the knowledgeable btrfs folks includes a cute little "well, you should restore that from backups… you do have backups, don't you". So tell me folks, what exactly do you use to backup a 4TB hard drive? Because I can't exactly, you know, write it out to a DVD, and if hard drives are this unreliable then what good are backups to hard drives?

So serious questions:

  1. Are these checksum errors really normal and expected?
  2. Why am I seeing them on files that were only copied today?
  3. Will regular scrubs be enough to protect against this?
  4. Should I buy new hard drives and throw out all the ones currently in the machine because they really are failing?
  5. How do you recommend backing up multiple-terabyte data drives?

Update 2022-01-07: I ran smartctl on all of the drives and these are reporting no problems at all. Raw UDMA_CRC_Error_Count is 0 for all drives. Tried to restore corrupted files… the tar file copied to machine failed after a few files with an I/O error. Really no idea what's going on here:

  • If the drives or the cables were bad, this would show up in SMART, right?
  • If the CPU or the memory were bad, the system wouldn't be running flawlessly? (Currently up 115 days with no obvious issues)?
  • If this were an across-the-board bug with btrfs, wouldn't it be all over the internet?

So where could the problem actually be?

Best Answer

I'm answering my own question because I think this is sort of interesting and might be of use to someone.

TL;DR The root cause of the reported problems appears to have been failing DRAM, not failing hard drives.

  1. No these checksums are not normal and expected. Another system running the same btrfs version was working perfectly well. They indicate something wrong, but not necessarily with the disks. See next item.
  2. They're showing up on newly copied data, because there's a major failure of the DRAM in the system, confirmed by X86MemTest. Only one of the two sticks was bad, and it happened that it was the stick mapped to higher memory, so only when the low memory was all used (rarely, but more frequently for larger files) did the failures bite. This is why they didn't affect the kernel.
  3. Regular scrubs might have detected the problem earlier. Regular scrubs don't help when you have a drive (e.g. /dev/sdc) which is not part of a mirror, because although it can see a checksum error, it doesn't have any hope of correcting it - this is fundamentally a limitation of btrfs, where they could have elected a checksum function with a larger hamming distance, but instead elected one that was faster to compute (I believe).
  4. I bought new hard drives, which can serve as backups, but various SMART tests and other efforts suggest the current drives are probably OK. The "all drives failing at once" is probably a good clue that the problem isn't the hard drives.
  5. As noted, large drives have become cheap... and given that the drives themselves don't seem to be the failure point, the idea of using hard drives for backup seems to have continued validity.
Related Topic