What Counts as a Large RAID 5 Array?

raidraid5

A recent issue with a Buffalo TeraStation NAS here in my office has got me investigating Raid 5.

I've found a few different articles talking about the unsuitability of using raid 5 in large arrays, or with large disks

Here is one example article that talks about problems with rebuilding an array with large consumer drives.

I'm trying to work out what counts as 'large'?

The NAS we have here is a 4 drive Raid 5 setup, each drive is 1 TB. A drive failed and has been replaced, the array is currently rebuilding.

Does this setup constitute as large, in terms of will likely have a problem during the rebuild?

How reliable is this setup for day to day use?

Best Answer

Designing the reliability of a disk array:

  1. Find the URE Rate of your drive (manufacturers don't like to talk about their drives failing, so you might have to dig to find this. It should be 1/10^X where X is around 12-18 commonly).
  2. Decide what is an acceptable risk rate for your storage needs†. Typically this is <0.5% chance of failure, but could be several percent in a "scratch" storage, and could be <0.1 for critical data.
  3. 1 - ( 1 - [Drive Size] x [URE Rate]) ^ [Data Drives‡] = [Risk]
    For arrays with more than one disk of parity or mirrors with more than a pair of disks in the mirror, change the 1 after the Drives in Array to the number of disks with parity/mirror.

So I've got a set of four 1TB WD Green drives in an array. They have a URE Rate of 1/10^14. And I use them in as scratch storage. 1 - (1 - 1TB x 1/10^14byte) ^ 3 => 3.3% risk of failure rebuilding the array after one drive dies. These are great for storing my junk, but I'm not putting critical data on there.

†Determining acceptable failure is a long and complicated process. It can be summarizes as Budget = Risk * Cost. So if a failure is going to cost $100, and has a 10% chance of happening then you should have a budget of $10 to prevent it. This grossly simplifies the task of determining the risk, the costs of various failures, and the nature of potential prevention techniques - but you get the idea. [Data Drives] = [Total Drives] - [Parity Drives]. A two disk mirror (RAID1) and RAID5 has 1 parity drive. A three disk mirror (RAID1) and RAID6 has 2 parity drives. It's possible to have more parity drives with RAID1 and/or custom schemes, but atypical.


This statistical equation does come with it's caveats however:

  • That URE Rate is the advertised rate and is commonly better in most drives rolling off the assembly line. You might get lucky and buy a drive that is orders of magnitude better than advertised. Similarly you could get a drive that dies of infant mortality.
  • Some manufacturing lines have bad runs (where many disks in the run fail at the same time), so getting disks from different manufacturing batches helps to distribute the likelihood of simultaneous failure.
  • Older disks are more likely to die under the stress of a rebuild.
  • Environmental factors take a toll:
    • Disks that are heat cycled commonly are more likely to die (eg. powering them on/off regularly).
    • Vibration can cause all kinds of issues - see video on YouTube of IT yelling at a disk array.
  • "There are three kinds of lies: lies, damned lies, and statistics" - Benjamin Disraeli
Related Topic