Munin’s smart plugin keeps reporting an error in the past because of the exit code

muninsmart

My hosting provider has inserted a hard drive into my server which seems to have had some sort of error in the past but a full offline smart check showed that everything is (about) ok at the moment. The server has a RAID1 so I can somewhat live with that situation.

Problem is that (according to the man page) smartctl sets bit no 6 if there was an error in the past, so now while everything is alright, the exit code is numeric 64.

The smart plugin is configured by default to have a threshold of 0, and while I know I could set the threshold up to 64, I would miss out on the much more important bit 3 "disk is failing".

Is there a way to set up a threshold in a way so that munin does bitwise comparison of the value?

Best Answer

Eventually I have resorted to patching the smart plugin. Depending on your version there is some code like this:

        if exit_status!=None :
            # smartctl exit code is a bitmask, check man page.
            num_exit_status=int(exit_status/256)

replace it with this

        if exit_status!=None :
            # smartctl exit code is a bitmask, check man page.
            num_exit_status=int(exit_status/256)
            # filter out bit 6
            num_exit_status &= 191
            if num_exit_status<=2 :
                exit_status=None

        if exit_status!=None :

The most interesting part is the line where there is a bitwise operation with 191: this is 0x11011111 in binary, so doing an AND operation with the current value it will just set bit no 6 to 0 while letting the other values untouched.

Therefore a value of 64 (as mine does) will be reported as 0 while a value of 8 would remain at 8. But also, very importantly, a value of 72 (bit 6 set as always and bit 3 set because the disk is failing) it would also report 8.

Related Topic