Write Hole – Which RAID Levels Are Affected?

mdadmraidraidzsoftware-raidstorage

In my journey to understanding the advantages of RAIDZ, i came across the concept of write hole.

As this page explains, a write hole is the inconsistency you get among the disks of the array, when the power is lost during a write. That page also explains that it affects both RAID-5/6 (if the power is lost after the data has been written, but before the parity has been calculated) and RAID-1 (data is written to one disk but not the others), and that it is an insidious problem that can only be detected during either a resync/scrub, or (disastrously) during the reconstruction of one of the disks…however, most of the other sources talk about it as it only affected parity-based RAID levels.

From what i understand, i think this could be a problem for RAID-1 too, as reads from the disks containing the hole would return garbage, so…is it a problem for every RAID level or not? Is it implementation-dependent? Does it affect software-RAID only, or also hardware controllers? (extra: how does mdadm fare in this regard?)

Best Answer

The term write hole is something used to describe two similar, but different, problems arising when dealing with non-battery-protected RAID arrays:

sometime it is improperly defined as any corruption in a RAID array due to sudden power loss. With this (erroneous) definition, RAID1 is vulnerable to write hole because you can not atomically write to two different disks;
the proper definition of write hole, which is the loss of an entire stripe data redundancy due to a sudden power loss during stripe update, is only applicable to parity-based RAID.

The second, and correct, definition of write hole needs some more explanation: let's assume a 3-disk RAID5 with 64K chunk size and 128K stripe size (+64K parity size for each stripe). If power is lost after writing 4K to disk #1 but during parity update on disk #3, we can have a bogus (ie: corrupted) parity chunk and an undetected data consistency issue. If, later, disk #2 dies and parity is used to recover the original data by xoring disk #1 and disk #3, the reconstructed 64K, originally residing on disk #2 and not recently written, will be nonetheless corrupted.

This is a contrived example, but it should expose the main problem related to write hole: the loss of untouched, at-rest, unrelated data sharing the same stripe with the latest, interrupted writes. In other word, if fileA was written years ago but shares the same stripe with the just-written fileB and the system loses power during fileB update, fileA will be at risk.

Another thing to consider is the write policy of the array: using read/reconstruct/write (ie: entire stripes are rewritten when partial write happens) versus read/modify/write (ie: only the affected chunk+parity are updated) expose to different kind of write hole.

From the above, it should be clear because RAID0 and RAID1 do not suffer from a proper write hole: they have no parity which can be "out-of-sync" invalidating an entire stripe. Please note that RAID1 mirror legs can be out-of-sync after an unclean shutdown, but the only corruption will be of the latest written data. Previously written data (ie: data at rest) will not face any trouble.

Having defined and scoped a proper write hole, how can be avoided?

HW RAID uses non volatile write cache (ie: BBU+DRAM or capacitory-backed flash module) to persistently store the to-be-written updates. If power is lost, the HW RAID card will re-issue any pending operation, flushing its cache to disk platters, when power is restore and system boot up. This protects not only from proper write hole, but from last-written data corruption also;
Linux MD RAID uses a write bitmap which records the to-be-written striped before updating them. If power is lost, the dirty bitmap is used to recalculate any parity data for the affected stripes. This protects from real write hole only; latest written data can be corrupted (unless backed by a fsync()+write barrier). The same method is used to re-sync out-of-sync portion of a RAID1 array (to be sure the two mirror legs are in-sync, albeit no write hole exists for mirrors);
newer Linux MD RAID5/6 should have the option to use a logging/journal device, partly simulating the non-volatile writeback cache of proper HW RAID card (and, depending on the specific patch/implementation, protecting from both write hole and last-written data corruption or from write hole only);
finally, RAIDZ avoid both write hole and last-data corruption using the most "elegant", but performance-impacting, method: by only writing full-sized stripes (and journaling any synchronized write in the ZIL/SLOG).

Useful links:
https://neil.brown.name/blog/20110614101708
https://www.kernel.org/doc/Documentation/md/raid5-ppl.txt
https://www.kernel.org/doc/Documentation/md/raid5-cache.txt
https://lwn.net/Articles/665299/

RAID: Why and When

RAID stands for Redundant Array of Independent Disks (some are taught "Inexpensive" to indicate that they are "normal" disks; historically there were internally redundant disks which were very expensive; since those are no longer available the acronym has adapted).

At the most general level, a RAID is a group of disks that act on the same reads and writes. SCSI IO is performed on a volume ("LUN"), and these are distributed to the underlying disks in a way that introduces a performance increase and/or a redundancy increase. The performance increase is a function of striping: data is spread across multiple disks to allow reads and writes to use all the disks' IO queues simultaneously. Redundancy is a function of mirroring. Entire disks can be kept as copies, or individual stripes can be written multiple times. Alternatively, in some types of raid, instead of copying data bit for bit, redundancy is gained by creating special stripes that contain parity information, which can be used to recreate any lost data in the event of a hardware failure.

There are several configurations that provide different levels of these benefits, which are covered here, and each one has a bias toward performance, or redundancy.

An important aspect in evaluating which RAID level will work for you depends on its advantages and hardware requirements (E.g.: number of drives).

Another important aspect of most of these types of RAID (0,1,5) is that they do not ensure the integrity of your data, because they are abstracted away from the actual data being stored. So RAID does not protect against corrupted files. If a file is corrupted by any means, the corruption will be mirrored or paritied and committed to the disk regardless. However, RAID-Z does claim to provide file-level integrity of your data.

Direct attached RAID: Software and Hardware

There are two layers at which RAID can be implemented on direct attached storage: hardware and software. In true hardware RAID solutions, there is a dedicated hardware controller with a processor dedicated to RAID calculations and processing. It also typically has a battery-backed cache module so that data can be written to disk, even after a power failure. This helps to eliminate inconsistencies when systems are not shut down cleanly. Generally speaking, good hardware controllers are better performers than their software counterparts, but they also have a substantial cost and increase complexity.

Software RAID typically does not require a controller, since it doesn't use a dedicated RAID processor or a separate cache. Typically these operations are handled directly by the CPU. In modern systems, these calculations consume minimal resources, though some marginal latency is incurred. RAID is handled by either the OS directly, or by a faux controller in the case of FakeRAID.

Generally speaking, if someone is going to choose software RAID, they should avoid FakeRAID and use the OS-native package for their system such as Dynamic Disks in Windows, mdadm/LVM in Linux, or ZFS in Solaris, FreeBSD, and other related distributions. FakeRAID use a combination of hardware and software which results in the initial appearance of hardware RAID, but the actual performance of software RAID. Additionally it is commonly extremely difficult to move the array to another adapter (should the original fail).

Centralized Storage

The other place RAID is common is on centralized storage devices, usually called a SAN (Storage Area Network) or a NAS (Network Attached Storage). These devices manage their own storage and allow attached servers to access the storage in various fashions. Since multiple workloads are contained on the same few disks, having a high level of redundancy is generally desirable.

The main difference between a NAS and a SAN is block vs. file system level exports. A SAN exports a whole "block device" such as a partition or logical volume (including those built on top of a RAID array). Examples of SANs include Fibre Channel and iSCSI. A NAS exports a "file system" such as a file or folder. Examples of NASs include CIFS/SMB (Windows file sharing) and NFS.

RAID 0

Good when: Speed at all costs!

Bad when: You care about your data

RAID0 (aka Striping) is sometimes referred to as "the amount of data you will have left when a drive fails". It really runs against the grain of "RAID", where the "R" stands for "Redundant".

RAID0 takes your block of data, splits it up into as many pieces as you have disks (2 disks → 2 pieces, 3 disks → 3 pieces) and then writes each piece of the data to a separate disk.

This means that a single disk failure destroys the entire array (because you have Part 1 and Part 2, but no Part 3), but it provides very fast disk access.

It is not often used in production environments, but it could be used in a situation where you have strictly temporary data that can be lost without repercussions. It is used somewhat commonly for caching devices (such as an L2Arc device).

The total usable disk space is the sum of all the disks in the array added together (e.g. 3x 1TB disks = 3TB of space).

RAID 1

Good when: You have limited number of disks but need redundancy

Bad when: You need a lot of storage space

RAID 1 (aka Mirroring) takes your data and duplicates it identically on two or more disks (although typically only 2 disks). If more than two disks are used the same information is stored on each disk (they're all identical). It is the only way to ensure data redundancy when you have less than three disks.

RAID 1 sometimes improves read performance. Some implementations of RAID 1 will read from both disks to double the read speed. Some will only read from one of the disks, which does not provide any additional speed advantages. Others will read the same data from both disks, ensuring the array's integrity on every read, but this will result in the same read speed as a single disk.

It is typically used in small servers that have very little disk expansion, such as 1RU servers that may only have space for two disks or in workstations that require redundancy. Because of its high overhead of "lost" space, it can be cost prohibitive with small-capacity, high-speed (and high-cost) drives, as you need to spend twice as much money to get the same level of usable storage.

The total usable disk space is the size of the smallest disk in the array (e.g. 2x 1TB disks = 1TB of space).

RAID 1

RAID 1E

The 1E RAID level is similar to RAID 1 in that data is always written to (at least) two disks. But unlike RAID1, it allows for an odd number of disks by simply interleaving the data blocks among several disks.

Performance characteristics are similar to RAID1, fault tolerance is similar to RAID 10. This scheme can be extended to odd numbers of disks more than three (possibly called RAID 10E, though rarely).

RAID 1E

RAID 10

Good when: You want speed and redundancy

Bad when: You can't afford to lose half your disk space

RAID 10 is a combination of RAID 1 and RAID 0. The order of the 1 and 0 is very important. Say you have 8 disks, it will create 4 RAID 1 arrays, and then apply a RAID 0 array on top of the 4 RAID 1 arrays. It requires at least 4 disks, and additional disks have to be added in pairs.

This means that one disk from each pair can fail. So if you have sets A, B, C and D with disks A1, A2, B1, B2, C1, C2, D1, D2, you can lose one disk from each set (A,B,C or D) and still have a functioning array.

However, if you lose two disks from the same set, then the array is totally lost. You can lose up to (but not guaranteed) 50% of the disks.

You are guaranteed high speed and high availability in RAID 10.

RAID 10 is a very common RAID level, especially with high capacity drives where a single disk failure makes a second disk failure more likely before the RAID array is rebuilt. During recovery, the performance degradation is much lower than its RAID 5 counterpart as it only has to read from one drive to reconstruct the data.

The available disk space is 50% of the sum of the total space. (e.g. 8x 1TB drives = 4TB of usable space). If you use different sizes, only the smallest size will be used from each disk.

It is worth noting that the Linux kernel's software raid driver called md allows for RAID 10 configurations with an odd amount of drives, i.e. a 3 or 5 disk RAID 10.

RAID 10

RAID 01

Good when: never

Bad when: always

It is the reverse of RAID 10. It creates two RAID 0 arrays, and then puts a RAID 1 over the top. This means that you can lose one disk from each set (A1, A2, A3, A4 or B1, B2, B3, B4). It's very rare to see in commercial applications, but is possible to do with software RAID.

To be absolutely clear:

If you have a RAID10 array with 8 disks and one dies (we'll call it A1) then you'll have 6 redundant disks and 1 without redundancy. If another disk dies there's a 85% chance your array is still working.
If you have a RAID01 array with 8 disks and one dies (we'll call it A1) then you'll have 3 redundant disks and 4 without redundancy. If another disk dies there's a 43% chance your array is still working.

It provides no additional speed over RAID 10, but substantially less redundancy and should be avoided at all costs.

RAID 5

Good when: You want a balance of redundancy and disk space or have a mostly random read workload

Bad when: You have a high random write workload or large drives

RAID 5 has been the most commonly-used RAID level for decades. It provides the system performance of all the drives in the array (except for small random writes, which incur a slight overhead). It uses a simple XOR operation to calculate parity. Upon single drive failure, the information can be reconstructed from the remaining drives using the XOR operation on the known data.

Unfortunately, in the event of a drive failure, the rebuilding process is very IO-intensive. The larger the drives in the RAID, the longer the rebuild will take, and the higher the chance for a second drive failure. Since large slow drives both have a lot more data to rebuild and a lot less performance to do it with, it is not usually recommended to use RAID 5 with anything 7200 RPM or lower.

Perhaps the most critical issue with RAID 5 arrays, when used in consumer applications, is that they are almost guaranteed to fail when the total capacity exceeds 12TB. This is because the unrecoverable read error (URE) rate of SATA consumer drives is one per every 10¹⁴ bits, or ~12.5TB.

If we take an example of a RAID 5 array with seven 2 TB drives: when a drive fails there are six drives left. In order to rebuild the array the controller needs to read through six drives at 2 TB each. Looking at the figure above it is almost certain another URE will occur before the rebuild has finished. Once that happens the array and all data on it is lost.

http://www.zdnet.com/article/why-raid-5-stops-working-in-2009

However the URE/data loss/array failure with RAID 5 issue in consumer drives has been somewhat mitigated by the fact that most hard disk manufacturers have increased their newer drives' URE ratings to one in 10¹⁵ bits. As always, check the specification sheet before buying!

https://www.zdnet.com/article/why-raid-5-still-works-usually/

It is also imperative that RAID 5 be put behind a reliable (battery-backed) write cache. This avoids the overhead for small writes, as well as flaky behaviour that can occur upon a failure in the middle of a write.

RAID 5 is the most cost-effective solution of adding redundant storage to an array, as it requires the loss of only 1 disk (E.g. 12x 146GB disks = 1606GB of usable space). It requires a minimum of 3 disks.

RAID 5

RAID 6

Good when: You want to use RAID 5, but your disks are too large or slow

Bad when: You have a high random write workload

RAID 6 is similar to RAID 5 but it uses two disks worth of parity instead of just one (the first is XOR, the second is a LSFR), so you can lose two disks from the array with no data loss. The write penalty is higher than RAID 5 and you have one less disk of space.

It is worth considering that eventually a RAID 6 array will encounter similar problems as a RAID 5. Larger drives cause larger rebuild times and more latent errors, eventually leading to a failure of the entire array and loss of all data before a rebuild has completed.

RAID 6

RAID 50

Good when: You have a lot of disks that need to be in a single array and RAID 10 isn't an option because of capacity

Bad when: You have so many disks that many simultaneous failures are possible before rebuilds complete, or when you don't have many disks

RAID 50 is a nested level, much like RAID 10. It combines two or more RAID 5 arrays and stripes data across them in a RAID 0. This offers both performance and multiple disk redundancy, as long as multiple disks are lost from different RAID 5 arrays.

In a RAID 50, disk capacity is n-x, where x is the number of RAID 5s that are striped across. For example, if a simple 6-disk RAID 50, the smallest possible, if you had 6x1TB disks in two RAID 5s that were then striped across to become a RAID 50, you would have 4TB usable storage.

RAID 60

Good when: You have a similar use case to RAID 50, but need more redundancy

Bad when: You don't have a substantial number of disks in the array

RAID 6 is to RAID 60 as RAID 5 is to RAID 50. Essentially, you have more than one RAID 6 that data is then striped across in a RAID 0. This setup allows for up to two members of any individual RAID 6 in the set to fail without data loss. Rebuild times for RAID 60 arrays can be substantial, so it's usually a good idea to have one hot-spare for each RAID 6 member in the array.

In a RAID 60, disk capacity is n-2x, where x is the number of RAID 6s that are striped across. For example, if a simple 8 disk RAID 60, the smallest possible, if you had 8x1TB disks in two RAID 6s that were then striped across to become a RAID 60, you would have 4TB usable storage. As you can see, this gives the same amount of usable storage that a RAID 10 would give on an 8 member array. While RAID 60 would be slightly more redundant, the rebuild times would be substantially larger. Generally, you want to consider RAID 60 only if you have a large number of disks.

RAID-Z

Good when: You are using ZFS on a system that supports it

Bad when: Performance demands hardware RAID acceleration

RAID-Z is a bit complicated to explain since ZFS radically changes how storage and file systems interact. ZFS encompasses the traditional roles of volume management (RAID is a function of a Volume Manager) and file system. Because of this, ZFS can do RAID at the file's storage block level rather than at the volume's strip level. This is exactly what RAID-Z does, write the file's storage blocks across multiple physical drives including a parity block for each set of stripes.

An example may make this much more clear. Say you have 3 disks in a ZFS RAID-Z pool, the block size is 4KB. Now you write a file to the system that is exactly 16KB. ZFS will split that into four 4KB blocks (as would a normal operating system); then it will calculate two blocks of parity. Those six blocks will be placed on the drives similar to how RAID-5 would distribute data and parity. This is an improvement over RAID5 in that there was no reading of existing data stripes to calculate the parity.

Another example builds on the previous. Say the file was only 4KB. ZFS will still have to build one parity block, but now the write load is reduced to 2 blocks. The third drive will be free to service any other concurrent requests. A similar effect will be seen anytime the file being written is not a multiple of the pool's block size multiplied by the number of drives less one (ie [File Size] <> [Block Size] * [Drives - 1]).

ZFS handling both Volume Management and File System also means you don't have to worry about aligning partitions or stripe-block sizes. ZFS handles all that automatically with the recommended configurations.

The nature of ZFS counteracts some of the classic RAID-5/6 caveats. All writes in ZFS are done in a copy-on-write fashion; all changed blocks in a write operation are written to a new location on disk, instead of overwriting the existing blocks. If a write fails for any reason, or the system fails mid-write, the write transaction either occurs completely after system recovery (with the help of the ZFS intent log) or does not occur at all, avoiding potential data corruption. Another issue with RAID-5/6 is potential data loss or silent data corruption during rebuilds; regular zpool scrub operations can help to catch data corruption or drive issues before they cause data loss, and checksumming of all data blocks will ensure that all corruption during a rebuild is caught.

The main disadvantage to RAID-Z is that it is still software raid (and suffers from the same minor latency incurred by the CPU calculating the write load instead of letting a hardware HBA offload it). This may be resolved in the future by HBAs that support ZFS hardware acceleration.

Other RAID and Non-Standard Functionality

Because there's no central authority enforcing any sort of standard functionality, the various RAID levels have evolved and been standardized by prevalent use. Many vendors have produced products which deviate from the above descriptions. It's also quite common for them to invent some fancy new marketing terminology to describe one of the above concepts (this happens most frequently in the SOHO market). When possible, try to get the vendor to actually describe the functionality of the redundancy mechanism (most will volunteer this information, as there's really no secret sauce anymore).

Worth mentioning, there are RAID 5-like implementations which allow you to start an array with only two disks. It would store data on one stripe and parity on the other, similar to RAID 5 above. This would perform like RAID 1 with the extra overhead of the parity calculation. The advantage is that you could add disks to the array by recalculating the parity.

Best Answer

Related Solutions

2008 R2 Software Raid 5, writing during resynching

RAID Levels – Widely Used RAID Levels and When to Consider Them

RAID: Why and When

Direct attached RAID: Software and Hardware

Centralized Storage

RAID 0

Good when: Speed at all costs!

Bad when: You care about your data

RAID 1

Good when: You have limited number of disks but need redundancy

Bad when: You need a lot of storage space

RAID 1E

RAID 10

Good when: You want speed and redundancy

Bad when: You can't afford to lose half your disk space

RAID 01

Good when: never

Bad when: always

RAID 5

Good when: You want a balance of redundancy and disk space or have a mostly random read workload

Bad when: You have a high random write workload or large drives

RAID 6

Good when: You want to use RAID 5, but your disks are too large or slow

Bad when: You have a high random write workload

RAID 50

Good when: You have a lot of disks that need to be in a single array and RAID 10 isn't an option because of capacity

Bad when: You have so many disks that many simultaneous failures are possible before rebuilds complete, or when you don't have many disks

RAID 60

Good when: You have a similar use case to RAID 50, but need more redundancy

Bad when: You don't have a substantial number of disks in the array

RAID-Z

Good when: You are using ZFS on a system that supports it

Bad when: Performance demands hardware RAID acceleration

Other RAID and Non-Standard Functionality

Related Topic