Nagios service check

nagios

I am new to nagios and we have a small issue I need to ask assistance with. Many of the machines that we monitor can go unresponsive for a bit when some very intensive cpu tasks are run. This makes nagios send warnings and alerts while these hosts are busy reporting things like 'ping timeout' or 'zombie processes' and even swap space warnings, but in actuality there is not a problem.

Is there a way to configure nagios to not send such alerts, but check x number of times over a period of time and only then send an alert at the end of that time if the server in question has not recovered?.

Looking at the commands.cfg file, I see entries like this:

define command{
        command_name    check_local_swap
        command_line    $USER1$/check_swap -w $ARG1$ -c $ARG2$
        }

How could I modify this example to accomplish what I want above.

Thanks

Best Answer

First, you could alter the parameters of the check(s) in question by adjusting the check_command directive(s):

For example:

    check_command           check_nrpe!check_zombie_procs!1 5

If you want to tolerate more zombie processes, just increase the numbers.

Once you have the thresholds adjusted to your liking, you could further prune spurious alerts by increasing max_check_attempts.

For example:

max_check_attempts      3

This allows the host/service to enter a "soft" non-OK state pending two further checks; you'll be alerted on the third.

See also: Nagios State Types, Nagios Object Definitions