Linux – Diagnosing disk health with smartctl

linuxraidsmartmontoolsUbuntu

How do you determine if a disk has problems using smartctl?

I have an Ubuntu 12.04 server using software RAID1, which became completely unresponsive. I rebooted, and it hung at boot with the message "/tmp is not ready or not present", so I skipped and started up a manual recovery terminal. Everything seemed fine, except my RAID resync was horribly slow. However, cat /proc/mdstat didn't show any actual RAID failure.

I bumped up my /proc/sys/dev/raid/speed_limit_min following the instructions here, but that didn't help too much. My 1TB array has been resyncing for 30 minutes now, but it's only 0.3% complete.

So I installed smartmontools and checked the disks using:

sudo smartctl --all /dev/sda
sudo smartctl --all /dev/sdb

Both report a "PASSED" health, but sdb is also showing several lines like:

Error 83 occurred at disk power-on lifetime: 15147 hours
Error 82 occurred at disk power-on lifetime: 15147 hours
Error 81 occurred at disk power-on lifetime: 15147 hours
Error 80 occurred at disk power-on lifetime: 15147 hours

along with some sort of hex-dump for each.

What does this mean? Should I interpret these errors to mean my sdb disk is dying? How do I confirm this?

Edit: Also related, ever since the crash, I've now unable to SSH into the server. I can access it just fine from a physical terminal, and there doesn't seem to be any excessive load. I made sure the firewall was disabled, and I can still ping the server, but ssh myuser@myserver results in "Connection timed out".

Best Answer

If one of the disks fell out of the raid, there is likely a reason. I would replace the failed disk (sounds like sdb) and rebuild to that instead. On to the smart data.

There is a big section in the smartctl -a output on the Smart Data Structure. This is a big matrix of words and numbers that tells you the current thresholds for particular tests. Some of the big ones you want to look out for are:

  • Raw_Read_Error_Rate (id 1)
  • Reallocated_Sector_Ct (id 5)
  • Spin_Retry_Count (id 10)
  • Reported_Uncorrect (id 187)
  • Offline_Uncorrectable (id 198)

These all relate to issues with the surface of the disk (except for id 10, which is related to the spindle motor). The surface of the disk is most likely to fail of all the things in the drive. If any of these is abnormally high (in the hundreds or thousands), you know for sure there is a big problem.

The registers at the bottom look something like this:

ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

In this case, there was a UNC error on the disk (uncorrectable read/write error).

My opinion is that if you see anything like this:

Error 518 occurred at disk power-on lifetime: 16859 hours

...the disk should be replaced when it is convenient to do so.

The SSH issue may be related to the disk (it could be that the corrupt portion is under the SSH binary), but this is likely something else you should investigate separately.

Related Topic