I use the data coming out of the (i)DRAC, combined with the data that ESXi harvests via CIM, with vCenter configured to alert on faults coming out of the CIM monitoring.
I'm a little unclear on what you're saying about the trustworthiness of the CIM data, but I personally trust it a heck of a lot more than I would trust the SNMP traps being fed to WhatsUp. CIM will catch and throw alerts on something as minor as low voltage on the BIOS battery, as long as your hardware is well supported (as recent Dell equipment is), and vCenter is pretty flexible about choosing what, where, and how often you throw alerts on those events.
If one of the disks fell out of the raid, there is likely a reason. I would replace the failed disk (sounds like sdb) and rebuild to that instead. On to the smart data.
There is a big section in the smartctl -a
output on the Smart Data Structure. This is a big matrix of words and numbers that tells you the current thresholds for particular tests. Some of the big ones you want to look out for are:
- Raw_Read_Error_Rate (id 1)
- Reallocated_Sector_Ct (id 5)
- Spin_Retry_Count (id 10)
- Reported_Uncorrect (id 187)
- Offline_Uncorrectable (id 198)
These all relate to issues with the surface of the disk (except for id 10, which is related to the spindle motor). The surface of the disk is most likely to fail of all the things in the drive. If any of these is abnormally high (in the hundreds or thousands), you know for sure there is a big problem.
The registers at the bottom look something like this:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
In this case, there was a UNC error on the disk (uncorrectable read/write error).
My opinion is that if you see anything like this:
Error 518 occurred at disk power-on lifetime: 16859 hours
...the disk should be replaced when it is convenient to do so.
The SSH issue may be related to the disk (it could be that the corrupt portion is under the SSH binary), but this is likely something else you should investigate separately.
Best Answer
This depends on a variety of things. Many raid-controllers have their own tools to query this kind of information;
HP's SmartArray-controllers use HP's
hpacucli
-tool. I would in general recommend using this nagios plugin for checking the health of those disks, and HP server health in general.DELL's servers have their own
openmanage
-drivers that needs to be queried. A good Nagios-plugin for this is to be found here.If your harddrives support S.M.A.R.T (I believe all of them nowadays do), you can use check_smartmon.
Both of the above check RAID-status as well as physical drives. In some cases, if you make sure to update the plugins now and then - you will also be told when it's appropriate to update your firmware.
For software RAID in Linux, check_md_adm can be used.
There's a plugin for monitoring ZFS-pools on Nagios Exchange: link.