Bit Rot on Hard Drives – Real Problem and Solutions

hard driveraidzfs

A friend is talking with me about the problem of bit rot – bits on drives randomly flipping, corrupting data. Incredibly rare, but with enough time it could be a problem, and it's impossible to detect.

The drive wouldn't consider it to be a bad sector, and backups would just think the file has changed. There's no checksum involved to validate integrity. Even in a RAID setup, the difference would be detected but there would be no way to know which mirror copy is correct.

Is this a real problem? And if so, what can be done about it? My friend is recommending zfs as a solution, but I can't imagine flattening our file servers at work, putting on Solaris and zfs..

Best Answer

First off: Your file system may not have checksums, but your hard drive itself has them. There's S.M.A.R.T., for example. Once one bit too many got flipped, the error can't be corrected, of course. And if you're really unlucky, bits can change in such a way that the checksum won't become invalid; then the error won't even be detected. So, nasty things can happen; but the claim that a random bit flipping will instantly corrupt you data is bogus.

However, yes, when you put trillions of bits on a hard drive, they won't stay like that forever; that's a real problem! ZFS can do integrity checking every time data is read; this is similar to what your hard drive already does itself, but it's another safeguard for which you're sacrificing some space, so you're increasing resilience against data corruption.

When your file system is good enough, the probability of an error occurring without being detected becomes so low that you don't have to care about that any longer and you might decide that having checksums built into the data storage format you're using is unnecessary.

Either way: no, it's not impossible to detect.

But a file system, by itself, can never be a guarantee that every failure can be recovered from; it's not a silver bullet. You still must have backups and a plan/algorithm for what to do when an error has been detected.