Self-Healing Filesystems – Benefits for General Usage

data integrityfilesystemsmdadmraidsoftware-raid

I have recently looked into advanced filesystems (Btrfs, ZFS) for data redundancy and availability and got interested in the additional functionality they provide, especially their "self-healing" capabilities against data corruption.

However, I think I need to take a step back and try to understand if this benefit outweighs their disadvantages (Btrfs bugs and unresolved issues & ZFS availability and performance impact) for general home/SMB-usage, compared to a conventional mdadm-Raid1 + Ext4 solution. A mirrored backup is available either way.

Let's assume I have a couple of file servers which are used for archival purposes and have limited resources, but ECC memory and a stable power source.

How likely am I to even encounter actual data corruption making files unreadable? How?
Can Ext4 or the system file manager already detect data errors on copy/move operations, making me at least aware of a problem?
What happens if one of the madam-Raid1 drives holds different data due to one drive having bad sectors? Will I still be able to retrieve the correct file or will the array be unable to decide which file is the correct one and lose it entirely?

Best Answer

Yes, a functional checksummed filesystem is a very good thing. However, the real motivation is not to be found into the mythical "bitrot" which, while does happen, is very rare. Rather, the main advantage is that such a filesystem provide and end-to-end data checksum, actively protecting you by erroneous disk behavior as misdirected writes and data corruption related to the disk's own private DRAM cache failing and/or misbehaving due to power supply problem.

I experienced that issue first hand, when a Linux RAID 1 array went bad due to a power supply issue. The cache of one disk started corrupting data and the ECC embedded in the disk sectors themselves did not catch anythig, simply because the written data were already corrupted and the ECC was calculated on the corrupted data themselves.

Thanks to its checksummed journal, which detected something strange and suspended the filesystem, XFS limited the damage; however, some files/directories were irremediably corrupted. As this was a backup machine facing no immediate downtime pressure, I rebuilt it with ZFS. When the problem re-occured, during the first scrub ZFS corrected the affected block by reading the good copies from the other disks. Result: no data loss and no downtime. These are two very good reasons to use a checksumming filesystem.

It's worth note that data checksum is so valuable that a device mapper target to provide it (by emulating the T-10 DIF/DIX specs), called dm-integrity, was developed precisely to extend this protection to classical block devices (especially redundant ones as RAID1/5/6). By the virtue of the Stratis project, it is going to be integrated into a comprehensive management CLI/API.

However, you have a point that any potential advantage brought by such filesystem should be compared to the disvantage they inherit. ZFS main problem is that it is not mainlined into the standard kernel, but otherwise is it very fast and stable. On the other hand BTRFS, while mainlined, has many important issues and performance problem (the common suggestion for databases or VMs is to disable CoW which, in turn, disabled checksumming - which is, frankly, not an acceptable answer). Rather then using BTRFS, I would use XFS and hope for the best, or using dm-integrity protected devices.

Related Solutions

Linux – How to make a Linux software RAID1 detect disc corruption

You can force a check of (eg) md0 with

echo "check" > /sys/block/md0/md/sync_action

You can check the state of the test with

cat /sys/block/md0/md/sync_action

while it returns check the check is running, once it returns idle you can do a

cat /sys/block/$dev/md/mismatch_cnt

to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.

So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.

I'd add that I agree with the docs when they say that

RAID cannot and is not supposed to guard against data corruption on the media

RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.

Transparent Compression Filesystem with ext4 – How to Implement

I use ZFS on Linux as a volume manager and a means to provide additional protections and functionality to traditional filesystems. This includes bringing block-level snapshots, replication, deduplication, compression and advanced caching to the XFS or ext4 filesystems.

See: https://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/ for another explanation.

In my most common use case, I leverage the ZFS zvol feature to create a sparse volume on an existing zpool. That zvol's properties can be set just like a normal ZFS filesystem's. At this juncture, you can set properties like compression type, volume size, caching method, etc.

Creating this zvol presents a block device to Linux that can be formatted with the filesystem of your choice. Use fdisk or parted to create your partition and mkfs the finished volume.

Mount this and you essentially have a filesystem backed by a zvol and with all of its properties.

Here's my workflow...

Create a zpool comprised of four disks:
You'll want the ashift=12 directive for the type of disks you're using. The zpool name is "vol0" in this case.

zpool create -o ashift=12 -f vol0 mirror scsi-AccOW140403AS1322043 scsi-AccOW140403AS1322042 mirror scsi-AccOW140403AS1322013 scsi-AccOW140403AS1322044

Set initial zpool settings:
I set autoexpand=on at the zpool level in case I ever replace the disks with larger drives or expand the pool in a ZFS mirrors setup. I typically don't use ZFS raidz1/2/3 because of poor performance and the inability to expand the zpool.

zpool set autoexpand=on vol0

Set initial zfs filesystem properties:
Please use the lz4 compression algorithm for new ZFS installations. It's okay to leave it on all the time.

zfs set compression=lz4 vol0
zfs set atime=off vol0

Create ZFS zvol:
For ZFS on Linux, it's very important that you use a large block size. -o volblocksize=128k is absolutely essential here. The -s option creates a sparse zvol and doesn't consume pool space until it's needed. You can overcommit here, if you know your data well. In this case, I have about 444GB of usable disk space in the pool, but I'm presenting an 800GB volume to XFS.

zfs create -o volblocksize=128K -s -V 800G vol0/pprovol

Partition zvol device:
(should be /dev/zd0 for the first zvol; /dev/zd16, /dev/zd32, etc. for subsequent zvols)

fdisk /dev/zd0 # (create new aligned partition with the "c" and "u" parameters)

Create and mount the filesystem:
mkfs.xfs or ext4 on the newly created partition, /dev/zd0p1.

mkfs.xfs -f -l size=256m,version=2 -s size=4096 /dev/zd0p1

Grab the UUID with blkid and modify /etc/fstab.

UUID=455cae52-89e0-4fb3-a896-8f597a1ea402 /ppro       xfs     noatime,logbufs=8,logbsize=256k 1 2

Mount the new filesystem.

mount /ppro/

Results...

[root@Testa ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sde2        20G  8.9G  9.9G  48% /
tmpfs            32G     0   32G   0% /dev/shm
/dev/sde1       485M   63M  397M  14% /boot
/dev/sde7       2.0G   68M  1.9G   4% /tmp
/dev/sde3        12G  2.6G  8.7G  24% /usr
/dev/sde6       6.0G  907M  4.8G  16% /var
/dev/zd0p1      800G  398G  403G  50% /ppro  <-- Compressed ZFS-backed XFS filesystem.
vol0            110G  256K  110G   1% /vol0

ZFS filesystem listing.

[root@Testa ~]# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
vol0           328G   109G   272K  /vol0
vol0/pprovol   326G   109G   186G  -   <-- The actual zvol providing the backing for XFS.
vol1           183G   817G   136K  /vol1
vol1/images    183G   817G   183G  /images

ZFS zpool list.

[root@Testa ~]# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vol0   444G   328G   116G    73%  1.00x  ONLINE  -
  mirror   222G   164G  58.1G         -
    scsi-AccOW140403AS1322043      -      -      -         -
    scsi-AccOW140403AS1322042      -      -      -         -
  mirror   222G   164G  58.1G         -
    scsi-AccOW140403AS1322013      -      -      -         -
    scsi-AccOW140403AS1322044      -      -      -         -

ZFS zvol properties (take note of referenced, compressratio and volsize).

[root@Testa ~]# zfs get all vol0/pprovol
NAME          PROPERTY               VALUE                  SOURCE
vol0/pprovol  type                   volume                 -
vol0/pprovol  creation               Sun May 11 15:27 2014  -
vol0/pprovol  used                   326G                   -
vol0/pprovol  available              109G                   -
vol0/pprovol  referenced             186G                   -
vol0/pprovol  compressratio          2.99x                  -
vol0/pprovol  reservation            none                   default
vol0/pprovol  volsize                800G                   local
vol0/pprovol  volblocksize           128K                   -
vol0/pprovol  checksum               on                     default
vol0/pprovol  compression            lz4                    inherited from vol0
vol0/pprovol  readonly               off                    default
vol0/pprovol  copies                 1                      default
vol0/pprovol  refreservation         none                   default
vol0/pprovol  primarycache           all                    default
vol0/pprovol  secondarycache         all                    default
vol0/pprovol  usedbysnapshots        140G                   -
vol0/pprovol  usedbydataset          186G                   -
vol0/pprovol  usedbychildren         0                      -
vol0/pprovol  usedbyrefreservation   0                      -
vol0/pprovol  logbias                latency                default
vol0/pprovol  dedup                  off                    default
vol0/pprovol  mlslabel               none                   default
vol0/pprovol  sync                   standard               default
vol0/pprovol  refcompressratio       3.32x                  -
vol0/pprovol  written                210M                   -
vol0/pprovol  snapdev                hidden                 default

Best Answer

Related Solutions

Linux – How to make a Linux software RAID1 detect disc corruption

Transparent Compression Filesystem with ext4 – How to Implement

Related Topic