You can force a check of (eg) md0 with
echo "check" > /sys/block/md0/md/sync_action
You can check the state of the test with
cat /sys/block/md0/md/sync_action
while it returns check
the check is running, once it returns idle
you can do a
cat /sys/block/$dev/md/mismatch_cnt
to see if the mismatch count is zero or not. Many distros automate this check to run eg weekly for you anyway, just as most industrial hardware RAIDs continually run this in the background (they often call it "RAID scrubbing") while the array is otherwise idle. Note that according to the comments in fedora's automated check file, RAID1 writes in the kernel are unbuffered and therefore mismatch counts can be non-zero even for a healthy array if the array is mounted.
So quiescing the arrays by doing this check while the VM is down, if at all possible, is probably a good idea.
I'd add that I agree with the docs when they say that
RAID cannot and is not supposed to
guard against data corruption on the
media
RAID is supposed to guard against complete failure of a device; guarding against incremental random failures in elements of a storage device is a job for error-checking and block-remapping, which is probably best done in the controller itself. I'm happy that the docs warn people of the limitations of RAID, especially if it's implemented on top of flaky devices. I find that frequent smartctl health checks of my drives help me to stay on top of drives which are starting to show the sort of errors that lead to out-of-sync mirrors.
I use ZFS on Linux as a volume manager and a means to provide additional protections and functionality to traditional filesystems. This includes bringing block-level snapshots, replication, deduplication, compression and advanced caching to the XFS or ext4 filesystems.
See: https://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/ for another explanation.
In my most common use case, I leverage the ZFS zvol feature to create a sparse volume on an existing zpool. That zvol's properties can be set just like a normal ZFS filesystem's. At this juncture, you can set properties like compression type, volume size, caching method, etc.
Creating this zvol presents a block device to Linux that can be formatted with the filesystem of your choice. Use fdisk
or parted
to create your partition and mkfs
the finished volume.
Mount this and you essentially have a filesystem backed by a zvol and with all of its properties.
Here's my workflow...
Create a zpool comprised of four disks:
You'll want the ashift=12
directive for the type of disks you're using. The zpool name is "vol0" in this case.
zpool create -o ashift=12 -f vol0 mirror
scsi-AccOW140403AS1322043
scsi-AccOW140403AS1322042 mirror
scsi-AccOW140403AS1322013
scsi-AccOW140403AS1322044
Set initial zpool settings:
I set autoexpand=on
at the zpool level in case I ever replace the disks with larger drives or expand the pool in a ZFS mirrors setup. I typically don't use ZFS raidz1/2/3 because of poor performance and the inability to expand the zpool.
zpool set autoexpand=on vol0
Set initial zfs filesystem properties:
Please use the lz4
compression algorithm for new ZFS installations. It's okay to leave it on all the time.
zfs set compression=lz4 vol0
zfs set atime=off vol0
Create ZFS zvol:
For ZFS on Linux, it's very important that you use a large block size. -o volblocksize=128k
is absolutely essential here. The -s
option creates a sparse zvol and doesn't consume pool space until it's needed. You can overcommit here, if you know your data well. In this case, I have about 444GB of usable disk space in the pool, but I'm presenting an 800GB volume to XFS.
zfs create -o volblocksize=128K -s -V 800G vol0/pprovol
Partition zvol device:
(should be /dev/zd0 for the first zvol; /dev/zd16, /dev/zd32, etc. for subsequent zvols)
fdisk /dev/zd0 # (create new aligned partition with the "c" and "u" parameters)
Create and mount the filesystem:
mkfs.xfs or ext4 on the newly created partition, /dev/zd0p1.
mkfs.xfs -f -l size=256m,version=2 -s size=4096 /dev/zd0p1
Grab the UUID with blkid
and modify /etc/fstab
.
UUID=455cae52-89e0-4fb3-a896-8f597a1ea402 /ppro xfs noatime,logbufs=8,logbsize=256k 1 2
Mount the new filesystem.
mount /ppro/
Results...
[root@Testa ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sde2 20G 8.9G 9.9G 48% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sde1 485M 63M 397M 14% /boot
/dev/sde7 2.0G 68M 1.9G 4% /tmp
/dev/sde3 12G 2.6G 8.7G 24% /usr
/dev/sde6 6.0G 907M 4.8G 16% /var
/dev/zd0p1 800G 398G 403G 50% /ppro <-- Compressed ZFS-backed XFS filesystem.
vol0 110G 256K 110G 1% /vol0
ZFS filesystem listing.
[root@Testa ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
vol0 328G 109G 272K /vol0
vol0/pprovol 326G 109G 186G - <-- The actual zvol providing the backing for XFS.
vol1 183G 817G 136K /vol1
vol1/images 183G 817G 183G /images
ZFS zpool list.
[root@Testa ~]# zpool list -v
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
vol0 444G 328G 116G 73% 1.00x ONLINE -
mirror 222G 164G 58.1G -
scsi-AccOW140403AS1322043 - - - -
scsi-AccOW140403AS1322042 - - - -
mirror 222G 164G 58.1G -
scsi-AccOW140403AS1322013 - - - -
scsi-AccOW140403AS1322044 - - - -
ZFS zvol properties (take note of referenced
, compressratio
and volsize
).
[root@Testa ~]# zfs get all vol0/pprovol
NAME PROPERTY VALUE SOURCE
vol0/pprovol type volume -
vol0/pprovol creation Sun May 11 15:27 2014 -
vol0/pprovol used 326G -
vol0/pprovol available 109G -
vol0/pprovol referenced 186G -
vol0/pprovol compressratio 2.99x -
vol0/pprovol reservation none default
vol0/pprovol volsize 800G local
vol0/pprovol volblocksize 128K -
vol0/pprovol checksum on default
vol0/pprovol compression lz4 inherited from vol0
vol0/pprovol readonly off default
vol0/pprovol copies 1 default
vol0/pprovol refreservation none default
vol0/pprovol primarycache all default
vol0/pprovol secondarycache all default
vol0/pprovol usedbysnapshots 140G -
vol0/pprovol usedbydataset 186G -
vol0/pprovol usedbychildren 0 -
vol0/pprovol usedbyrefreservation 0 -
vol0/pprovol logbias latency default
vol0/pprovol dedup off default
vol0/pprovol mlslabel none default
vol0/pprovol sync standard default
vol0/pprovol refcompressratio 3.32x -
vol0/pprovol written 210M -
vol0/pprovol snapdev hidden default
Best Answer
Yes, a functional checksummed filesystem is a very good thing. However, the real motivation is not to be found into the mythical "bitrot" which, while does happen, is very rare. Rather, the main advantage is that such a filesystem provide and end-to-end data checksum, actively protecting you by erroneous disk behavior as misdirected writes and data corruption related to the disk's own private DRAM cache failing and/or misbehaving due to power supply problem.
I experienced that issue first hand, when a Linux RAID 1 array went bad due to a power supply issue. The cache of one disk started corrupting data and the ECC embedded in the disk sectors themselves did not catch anythig, simply because the written data were already corrupted and the ECC was calculated on the corrupted data themselves.
Thanks to its checksummed journal, which detected something strange and suspended the filesystem, XFS limited the damage; however, some files/directories were irremediably corrupted. As this was a backup machine facing no immediate downtime pressure, I rebuilt it with ZFS. When the problem re-occured, during the first scrub ZFS corrected the affected block by reading the good copies from the other disks. Result: no data loss and no downtime. These are two very good reasons to use a checksumming filesystem.
It's worth note that data checksum is so valuable that a device mapper target to provide it (by emulating the T-10 DIF/DIX specs), called dm-integrity, was developed precisely to extend this protection to classical block devices (especially redundant ones as RAID1/5/6). By the virtue of the Stratis project, it is going to be integrated into a comprehensive management CLI/API.
However, you have a point that any potential advantage brought by such filesystem should be compared to the disvantage they inherit. ZFS main problem is that it is not mainlined into the standard kernel, but otherwise is it very fast and stable. On the other hand BTRFS, while mainlined, has many important issues and performance problem (the common suggestion for databases or VMs is to disable CoW which, in turn, disabled checksumming - which is, frankly, not an acceptable answer). Rather then using BTRFS, I would use XFS and hope for the best, or using dm-integrity protected devices.