I am trying to find a solution for the scenario below.
I have a few hundred services in a Nagios (OMD install with check_mk and other delicious stuff) and they are defined as different service types so for different types I have different contact groups who will get alerted when a problem occurs.
It is working well, but I would like to call a script if a service is on critical status after 1 hour and has npt been acknowledged/commented etc.
I have not found anything in the reference documentation.
Thank you for your help in advance
Typical service type:
define contact{
contact_name level1 ; Short name of user
use generic-contact ; Inherit default values from
alias Gravity Level1 ; Full name of user
email wtf@spamcop.net ; email for alerting
}
define contactgroup{
contactgroup_name defcon3
members level1, level2
}
define service{
name defcon3-service ; The 'name' of this service template
active_checks_enabled 0 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to
normal_check_interval 2 ; Check the service every 10 minutes under normal
retry_check_interval 1 ; Re-check the service every two minutes until a
notification_options w,u,c,r ; Send notifications about warning, unknown,
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
contact_groups defcon3 ; default mail to monitoring -v-
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A R
}
define service {
use check_mk_passive_perf
use defcon3-service
host_name gravity-mon
service_description CPU load
contact_groups +defcon3
service_groups +defcon3
check_command check_mk-cpu.loads
}
Best Answer
I hate to directly contradict another poster, but NAGIOS can do exactly that: what you're looking for is referred to in the documentation as notification escalations.
As the doco says,
So if you had a service called
HTTP
on a hostwebserver
, the failure of which ordinarily alerted the groupsysadmins
every 30 minutes (say), and you wanted the groupmanagers
to hear about it if a few times if the alerts were unacknowledged and unfixed by the third alert, you might try:In your case, you don't want to notify people, but invoke a script. For that, you'll need to define a new contact group which contains one member, which member has a
service_notification_commmand
of (eg)/usr/local/bin/my-webserver-handling-script
.If you don't want the script repeatedly invoked, you'll want to tune
first_notification
andlast_notification
above so that this particular escalation only applies once.I'd also caution you about doing this. I don't personally favour notification systems also becoming incident-handling systems; I think they should let a human know that something's not working right, and let the human deal with it, and here's why: by definition, NAGIOS only alerts people when things aren't going properly. If you're going to automate the handling of this, you need to be extremely sure that they've failed in exactly the right way. If, for example, you are going to have this script power-cycle the webserver, then you'd better be amazingly sure that you have all your host dependencies set up correctly so that the failure of an intermediate router doesn't also causes your webserver to start being savagely rebooted, thus causing file system corruption that you have to deal with after fixing the router.