RAID5 URE – Data Loss Risk During Rebuild with URE

raidzfszfsonlinux

I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?

(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)

Best Answer

It really depends on the specific RAID implementation:

  • most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).

  • linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);

  • ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.

Related Topic