How to make Status Information for Nagios services easier to read

alertsmonitoringnagios

I'm running Nagios in an environment with several servers, each with several services on them. There are a few custom checks, but it's nice to use existing checks if possible. I'm using NRPE plugin check check_disk to check each mounted file system for utilization:

command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -p / -p /var -C -u GB -w 200 -c 100 -r '^/mounts[^/]+$'

It's handy to have these all checked as a single service ("Disks"), but when one of these goes to warning mode, it's hard to read the output in the Status Information line:

DISK WARNING - free space: / 6 GB (9% inode=92%): /var 125 GB (67% inode=99%): /mounts/vol0 1152 GB (16% inode=99%): /mounts/vol1 1096 GB (15% inode=99%): /mounts/vol2 126 GB (1% inode=99%): /mounts/vol3 228 GB (3% inode=99%): /mounts/vol4 3245 GB (44% inode=99%): /mounts/vol5 108 GB (1% inode=99%): 

In the above case, the check is warning because /, /mounts/vol2, and /mounts/vol5 are below threshold. An operator has to wade through each value to find the value exceeding set levels. Also, if one in critical and the others are warning, it would be nice to show them differently, either by marking them, or by putting them on different lines.

Is there a straightforward way to do this, without creating a new command for every mount point? Or am I missing some other fundamental method of Nagios magic to make this friendly?

Best Answer

Try the --errors-only flag which should greatly reduce the amount of text spit out by this plugin.

 -e, --errors-only
 Display only devices/mountpoints with errors

This seems to do the trick for me. Note the drastic difference in the output:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% 
DISK WARNING - free space: / 37167 MB (96% inode=98%); /dev/shm 244 MB (100% inode=99%); /boot 84 MB (18% inode=99%); /home 21253 MB (99% inode=99%);

But with the --errors-only flag, it's now clear that my problem is with /boot:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK WARNING - free space: /boot 94 MB (20% inode=99%);

If there are no problems on the system, the output is very short:

# /usr/lib64/nagios/plugins/check_disk -w 20% -c 10% --errors-only
DISK OK

(Note: I have removed everything after the first | for clarity. The Nagios web interface also trims this output before it is displayed on the screen.)

Also see this discussion on the Debian bugtacker: nagios2: complains about disk space in an uncomprehensible way.