Linux – btrfs: Looking for experience on btrfs error modes on bad sectors / read errors / medium erros

btrfsfilesystemshdfslinuxstorage

While running a Hadoop cluster that uses HDFS (so data is already replicated 3x) we experience often issues with ext4 because some bad sectors (the filesystem is unable to read or write to them) causing ext4 to remount the filesystem read-only..

So far so good and we working on replacing the disks but I've stumbled about btrfs and it's metadata duplication and I'm interested how btrfs does react in such a situation?

Data errors do not matter for us, as the data is already checksummed and replicated by HDFS but a more robust metadata handling (e.g. if metadata can't be written in one place the duplicated metadata is used) would be a benefit because in theory the read-only remounts and required fsck should not happen that often if we would switch to btrfs…

So is here anyone running anyone btrfs without raid on desktop-hdds and can tell me how resilient the filesystem is against medium errors?

E.g. does the metadata duplication on a single disk is used to repair broken metadata or does the filesystem fail anyway?

Best Answer

I am definitely NO expert and have very little experience on filesystem in general. So, take what I write with pinch(or handful of salt) :)

Now disclaimer aside: Btrfs (to the best of my knowledge ) is not YET fault tolerant and it needs serious working in that department.If my memory serves right ZFS should serve your requirement better. I had myself been considering btrfs but I am not yet keen on using it. Incidently opensuse (I think) provide support for it, so maybe you find some info there.

Please do update, if you find solution elsewhere.

Hope, I was of some help.

http://en.wikipedia.org/wiki/ZFS

For ZFS, data integrity is achieved by using a (Fletcher-based) checksum or a (SHA-256) hash throughout the file system tree.[17] Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree.[17] In-flight data corruption or phantom reads/writes (the data written/read checksums correctly but is actually wrong) are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so the entire pool self-validates.[18]

When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it. If the values do not match, then ZFS can heal the data if the storage pool has redundancy via ZFS mirroring or RAID.[19] If the storage pool consists of a single disk, it is possible to provide such redundancy by specifying "copies=2" (or "copies=3"), which means that data will be stored twice (thrice) on the disk, effectively halving (or, for "copies=3", reducing to one third) the storage capacity of the disk.[20] If redundancy exists, ZFS will fetch a copy of the data (or recreate it via a RAID recovery mechanism), and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update the faulty copy with known-good data so that redundancy can be restored.