What is exactly an URE

drive-failurehard driveraidstorage

I have been looking into RAID5 Vs RAID6 lately and I keep seeing that RAID5 is not secure enough anymore because of the URE ratings and increasing size of the drives. Basically, most of the content I found says that in RAID5, in case you have a disk failure, if the rest of your array is 12TB, then you have almost 100% chance to meet a URE and to lose your data.

The 12TB figure comes from the fact that disks are rated at 10^14 bits read to reach one URE.

Well, there is something I do not get here. A read is done by the head going on the sector, what can make the reading failed is either the head dies or the sector dies. it can also be that the reading does not work for some other reason (I don't know, like a vibration made the head jumps…). so, let me address all 3 situations :

  • the reading does not work : that is not unrecoverable, right? it can be tried again.
  • the head dies : this would for sure be unrecoverable, but, that also means the full platter (or at least the side) would be unreadable, it would be more alarming, no?
  • the sector dies : as well totally unrecoverable, but here I do not understand why the 4TB disk is rated at 10^14 for the URE and the 8TB is as well rated at 10^14 for the URE, that would mean the sectors on the 8TB (most likely newer tech) are half as reliable as the ones on the 4TB, that does not make sense.

As you see, from the 3 failure points I identify, none makes sense. So what exactly is an URE, I mean concretely?

Is there somebody who can explain that to me?

Edit 1

After first wave of answers, it seems the reason is the sector failing. Good thing is that firmware, RAID controller and OS + filesystem have procedure in place to early detect that and reallocate sectors.

Well, I now know what is a URE (actually, the name is quite self-explanatory 🙂 ).

I am still puzzled by the underlying causes and mostly the stable rating they give.

Some attributed the failing sector to external sources (cosmic waves), I am then surprised that the URE rate is then based on the reading count and not on the age, the cosmic waves should indeed impact more an older disk simply because it has been exposed more, I think this is more of a fantasy though I might be wrong.

Now comes the other reason that relates to the wear of the disk and some pointed out that higher densities give weaker magnetic domains, that totally makes sense and I would follow the explanation. But As it is nicely explained here, the newer disks different sizes are obtained mostly by putting more or less of the same platter (and then same density) in the HDD chassis. The sectors are the same and all should have the very same reliability, so bigger disks should then have a higher rating than smaller disks, the sectors being read less, this is not the case, Why?
That would though explain why the newer disks with newer tech get no better rating than the old ones, simply because the better tech gain is offseted by the loss due to higher density.

Best Answer

A URE is an Unrecoverable Read Error. Something has happened that has caused the reading of a sector to fail that the drive cannot fix. The drive electronics are sophisticated, they will only pass the data up if they have been able to read it correctly from the disk. The drive electronics will try multiple times to read a bad sector before declaring it damaged.

What causes the read error - I'm not an expert here (arm waving ensues) but drive aging can cause manufacturing tolerances to become relevant. Magnetic domains can become weakened. Cosmic rays can cause damage etc. Essentially it is a random failure.

How does this affect RAID 5?

A RAID 5 consists of block level striping with distributed parity. The parity blocks are calculated by XORing the bits from the data blocks together. The XOR function basically says, if all the bits are the same the result is 0 otherwise it is 1. When calculating parity you take the first 2 bits and XOR them then XOR the result with the next bit and so on e.g.

1010   data      or    1010 data
1100   data            1100 data
0110   parity          0011 data
                       0101 parity

The nature of the XOR function is such that if any disk dies and is replaced, the data that should be on it can be reconstructed from the remaining disks.

1010  data       or    1010 data
      damaged               damaged
0101  parity           0011 data
                       0101 parity

As you can see the damaged data can be reconstructed by XORing the remaining data and parity.

How does a URE affect this?

A URE is only significant during a RAID 5 rebuild.

When you reconstruct a RAID 5 there is a large amount of reading to be done. Every data block needs to be read in order to reconstruct the data on the new disk. If a URE occurs then the data for the relevant block cannot be recovered so your data is inconsistent. For sufficiently large disks in a sufficiently large R5 the number of bits read to reconstruct the replaced disk exceeds the URE value of for example 1 bit in 10^14 read.