Write Hole – Which RAID Levels Are Affected?

mdadmraidraidzsoftware-raidstorage

In my journey to understanding the advantages of RAIDZ, i came across the concept of write hole.

As this page explains, a write hole is the inconsistency you get among the disks of the array, when the power is lost during a write. That page also explains that it affects both RAID-5/6 (if the power is lost after the data has been written, but before the parity has been calculated) and RAID-1 (data is written to one disk but not the others), and that it is an insidious problem that can only be detected during either a resync/scrub, or (disastrously) during the reconstruction of one of the disks…however, most of the other sources talk about it as it only affected parity-based RAID levels.

From what i understand, i think this could be a problem for RAID-1 too, as reads from the disks containing the hole would return garbage, so…is it a problem for every RAID level or not? Is it implementation-dependent? Does it affect software-RAID only, or also hardware controllers? (extra: how does mdadm fare in this regard?)

Best Answer

The term write hole is something used to describe two similar, but different, problems arising when dealing with non-battery-protected RAID arrays:

  • sometime it is improperly defined as any corruption in a RAID array due to sudden power loss. With this (erroneous) definition, RAID1 is vulnerable to write hole because you can not atomically write to two different disks;

  • the proper definition of write hole, which is the loss of an entire stripe data redundancy due to a sudden power loss during stripe update, is only applicable to parity-based RAID.

The second, and correct, definition of write hole needs some more explanation: let's assume a 3-disk RAID5 with 64K chunk size and 128K stripe size (+64K parity size for each stripe). If power is lost after writing 4K to disk #1 but during parity update on disk #3, we can have a bogus (ie: corrupted) parity chunk and an undetected data consistency issue. If, later, disk #2 dies and parity is used to recover the original data by xoring disk #1 and disk #3, the reconstructed 64K, originally residing on disk #2 and not recently written, will be nonetheless corrupted.

This is a contrived example, but it should expose the main problem related to write hole: the loss of untouched, at-rest, unrelated data sharing the same stripe with the latest, interrupted writes. In other word, if fileA was written years ago but shares the same stripe with the just-written fileB and the system loses power during fileB update, fileA will be at risk.

Another thing to consider is the write policy of the array: using read/reconstruct/write (ie: entire stripes are rewritten when partial write happens) versus read/modify/write (ie: only the affected chunk+parity are updated) expose to different kind of write hole.

From the above, it should be clear because RAID0 and RAID1 do not suffer from a proper write hole: they have no parity which can be "out-of-sync" invalidating an entire stripe. Please note that RAID1 mirror legs can be out-of-sync after an unclean shutdown, but the only corruption will be of the latest written data. Previously written data (ie: data at rest) will not face any trouble.

Having defined and scoped a proper write hole, how can be avoided?

  • HW RAID uses non volatile write cache (ie: BBU+DRAM or capacitory-backed flash module) to persistently store the to-be-written updates. If power is lost, the HW RAID card will re-issue any pending operation, flushing its cache to disk platters, when power is restore and system boot up. This protects not only from proper write hole, but from last-written data corruption also;

  • Linux MD RAID uses a write bitmap which records the to-be-written striped before updating them. If power is lost, the dirty bitmap is used to recalculate any parity data for the affected stripes. This protects from real write hole only; latest written data can be corrupted (unless backed by a fsync()+write barrier). The same method is used to re-sync out-of-sync portion of a RAID1 array (to be sure the two mirror legs are in-sync, albeit no write hole exists for mirrors);

  • newer Linux MD RAID5/6 should have the option to use a logging/journal device, partly simulating the non-volatile writeback cache of proper HW RAID card (and, depending on the specific patch/implementation, protecting from both write hole and last-written data corruption or from write hole only);

  • finally, RAIDZ avoid both write hole and last-data corruption using the most "elegant", but performance-impacting, method: by only writing full-sized stripes (and journaling any synchronized write in the ZIL/SLOG).

Useful links:
https://neil.brown.name/blog/20110614101708
https://www.kernel.org/doc/Documentation/md/raid5-ppl.txt
https://www.kernel.org/doc/Documentation/md/raid5-cache.txt
https://lwn.net/Articles/665299/