How to make a persistent acknowledgment in Icinga/Nagios

icingamonitoringnagios

I am using Icinga (Nagios fork) to also monitor uptime of external hosts and services. Currently when looking at the "Critical" count I find it difficult to decide if an internal service is affected (I should take immediate action) or an external service (I just acknowledge the problem).

Is there a way to keep a problem acknowledgement for future down-times of the checked host/service? Is there some way to auto-acknowledge the state change of external hosts/services?

Best Answer

Have found out how to do auto-acknowledges for external hosts.

First define an event handler for the external host:

define host {
        name            some-external-server
        # ...
        event_handler   handle_external_host
        # ...
}

Then define the command to be used as the event handler:

define command {
        command_name    handle_external_host
        command_line    $USER1$/eventhandlers/acknowledge_host_problem $HOSTNAME$ icingaadmin "Handled by external user"
}

Finally put the event handler script into file /usr/local/icinga/libexec/eventhandlers/acknowledge_host_problem (or where your event handers are installed):

#!/bin/sh

printf_cmd="/usr/bin/printf"
command_file="/usr/local/icinga/var/rw/icinga.cmd"

hostname="$1"
author="$2"
comment="$3"

# get the current date/time in seconds since UNIX epoch
now=`date +%s`

# pipe the command to the command file
$printf_cmd "[%lu] ACKNOWLEDGE_HOST_PROBLEM;%s;1;1;0;%s;%s\n" $now "$hostname" "$author" "$comment" >> $command_file

Don't forget to make the script executable using command "chmod +x" or similar. For details on ACKNOWLEDGE_HOST_PROBLEM see the Icinga documentation.

Related Solutions

Nagios: Possible to be alerted between certain times, and not when notification_period resumes

I can think of two ways to do this: (a) use an external command to change the check command (Nagios calls this "adaptive monitoring") or (b) split the service into two with different check commands and periods.

I'll use check_load as and example with these (skeletal) service and command definitions:

 define service{
   name          load
   host_name     foohost
   check_command check_load!1,1,1!2,2,2
   ... (all other options)
 }

 define command{
   name         check_load
   command_line $USER1$/check_load -w $ARG1$ -c $ARG2$
 }

For (a) suppose you wish to change these values at 8pm return them at 8am. In cron add

 0 20 * * * /some/path/change_load_check 3,3,3 4,4,4
 0  8 * * * /some/path/change_load_check 1,1,1 2,2,2

where change_load_check is

#!/bin/sh

now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'

W=$1
C=$2

/bin/printf "[%lu] CHANGE_SVC_CHECK_COMMAND;foohost;load;check_load!$W!$C\n" \
  $now > $commandfile

You need to have external commands enabled.

For (b) you would take the original service, turn it into a template, and create two new services that specify different periods and check commands like so:

 define service{
   name          load_template
   host_name     foohost
   ... (all other options)
   register      0
 }

 define service{
   name                load_workhours
   use                 load_template
   check_period        workhours
   notification_period workhours
   check_command       check_load!1,1,1!2,2,2
 }

 define service{
   name                load_offhours
   use                 load_template
   check_period        offhours
   notification_period offhours
   check_command       check_load!3,3,3!4,4,4
 }

Nagios: Send Escalation Alert if issue is Acknowledged but not Recovered (OK State)

I have a Perl script that does this. You simply need to scan the 'status.dat' file for:

host checks > last_time_up = <value>
service checks > last_time_ok = <value>

Both store an epoch value, which, if greater than a certain number of seconds from current epoch, will cause the check_description and/or host_name to be added to the email sent out. My script also checks the 'problem_has_been_acknowledged' entry and lets me know if it has been Ack'ed. Said script fires off from a crontab entry every 30 minutes sending out a listing of all host/service checks that matched.

Best Answer

Related Solutions

Nagios: Possible to be alerted between certain times, and not when notification_period resumes

Nagios: Send Escalation Alert if issue is Acknowledged but not Recovered (OK State)

Related Topic