Linux – Contact group for alerting after one hour in Nagios / OMD


I am trying to find a solution for the scenario below.

I have a few hundred services in a Nagios (OMD install with check_mk and other delicious stuff) and they are defined as different service types so for different types I have different contact groups who will get alerted when a problem occurs.

It is working well, but I would like to call a script if a service is on critical status after 1 hour and has npt been acknowledged/commented etc.

I have not found anything in the reference documentation.

Thank you for your help in advance

Typical service type:

define contact{
    contact_name                    level1          ; Short name of user
    use                             generic-contact         ; Inherit default values from
    alias                           Gravity Level1          ; Full name of user
    email                          ; email for alerting

define contactgroup{
    contactgroup_name       defcon3
    members                 level1, level2

 define service{
   name                            defcon3-service         ; The 'name' of this service template
   active_checks_enabled           0                       ; Active service checks are enabled
   passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
   obsess_over_service             1                       ; We should obsess over this service (if necessary)
   check_freshness                 0                       ; Default is to NOT check service 'freshness'
   notifications_enabled           1                       ; Service notifications are enabled
   event_handler_enabled           1                       ; Service event handler is enabled
   flap_detection_enabled          1                       ; Flap detection is enabled
   failure_prediction_enabled      1                       ; Failure prediction is enabled
   process_perf_data               1                       ; Process performance data
   retain_status_information       1                       ; Retain status information across program restarts
   retain_nonstatus_information    1                       ; Retain non-status information across  
   is_volatile                     0                       ; The service is not volatile
   check_period                    24x7                    ; The service can be checked at any time of the day
   max_check_attempts              3                       ; Re-check the service up to 3 times in order to 
   normal_check_interval           2                       ; Check the service every 10 minutes under normal 
   retry_check_interval            1                       ; Re-check the service every two minutes until a
   notification_options            w,u,c,r                 ; Send notifications about warning, unknown, 
   notification_interval           60                      ; Re-notify about service problems every hour
   notification_period             24x7                    ; Notifications can be sent out at any time
   contact_groups                  defcon3                 ; default mail to monitoring -v-
   register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A R

 define service {
    use                           check_mk_passive_perf
    use                           defcon3-service
    host_name                     gravity-mon
    service_description           CPU load
    contact_groups                +defcon3
    service_groups                +defcon3
    check_command                 check_mk-cpu.loads

Best Answer

I hate to directly contradict another poster, but NAGIOS can do exactly that: what you're looking for is referred to in the documentation as notification escalations.

As the doco says,

Notifications are escalated if and only if one or more escalation definitions matches the current notification that is being sent out. If a host or service notification does not have any valid escalation definitions that applies to it, the contact group(s) specified in either the host group or service definition will be used for the notification.

So if you had a service called HTTP on a host webserver, the failure of which ordinarily alerted the group sysadmins every 30 minutes (say), and you wanted the group managers to hear about it if a few times if the alerts were unacknowledged and unfixed by the third alert, you might try:

define serviceescalation{
    host_name           webserver
    service_description HTTP
    first_notification  3
    last_notification   5
    contact_groups      nt-admins,managers

In your case, you don't want to notify people, but invoke a script. For that, you'll need to define a new contact group which contains one member, which member has a service_notification_commmand of (eg) /usr/local/bin/my-webserver-handling-script.

If you don't want the script repeatedly invoked, you'll want to tune first_notification and last_notification above so that this particular escalation only applies once.

I'd also caution you about doing this. I don't personally favour notification systems also becoming incident-handling systems; I think they should let a human know that something's not working right, and let the human deal with it, and here's why: by definition, NAGIOS only alerts people when things aren't going properly. If you're going to automate the handling of this, you need to be extremely sure that they've failed in exactly the right way. If, for example, you are going to have this script power-cycle the webserver, then you'd better be amazingly sure that you have all your host dependencies set up correctly so that the failure of an intermediate router doesn't also causes your webserver to start being savagely rebooted, thus causing file system corruption that you have to deal with after fixing the router.