How reliable is HDD SMART data

hard drivemonitoringsmart

Based on SMART data, you can judge the health of a disk, at least that is the idea. If I, for instance, run sudo smartctl -H /dev/sda on my ArchLinux laptop, it says that the hard drive passed the self tests and that it should be "healthy" based on this.

My question is how reliable this information is, or more specifically:

  • If according to the SMART data this disk is healthy, what are the odds of the disk suddenly failing despite this? This assumes the failure is not due to some catastrophic event that impossibly could have been predicted, such as the laptop falling down on the floor causing the drive heads to hit the disk.
  • If the SMART data does not say the disk is in good shape, what are the odds of the disk failing within some amount of time? Is it possible that there will be false positives and how common are these?

Of course, I keep backups no matter what. I am mostly curious.

Best Answer

In my experience (20 years in operating servers, must have handled about 5.000 disks in all the servers I have dealt with) SMART is useful but no panacea.

If you get SMART errors replace the disk asap. Chances hare very high that with 4-8 weeks the disk will have serious issues. (The Google study frequently mentioned in this regard correlates very nicely with my personal experience.)
Typically you have a week or 2 before the disk becomes really problematic.

If you don't get SMART errors at all, the disk can still fail without any warning whatsoever, although that is quite rare in servers. I see may be 3 or 4 such cases per year. While we replace drives because of SMART errors at about 25/month.
This may have to do that server disks are usually part of a raid array and see a continuous read/write pattern all over the disk. This gets every part of the disk "exercised" (and checked) on a regular basis.
Biggest chance of a disk failing (without previous warning) is on startup if a server has been switched of for some time after been continuously run for months/years.

In consumer equipment (non-server, laptop/desktop-drives) I have seen plenty disks with read-errors that somehow didn't end up in SMART data, even though Windows logged those errors in the Event log. (SMART only did log them after a full chkdsk from Windows.)
This leads me to believe that, in many consumer drives, the SMART thresholds are quite low. This might be (big IF) intentional to keep RMA numbers low in this cut-throat business.
Many consumers will not notice the occasional bad block anyway until it is too late. (How many consumers know where to find the Event log ? That's the only place where you can see disk-errors in Windows.)
In my experience if a consumer disk has issues (SMART or otherwise), copy your data of it and replace it immediately. By the time it gives those errors it is already past dead.