Nagios: Send Escalation Alert if issue is Acknowledged but not Recovered (OK State)

monitoringnagios

I think I know the answer (not possible) – but want to see if anyone has a clever idea or perhaps I'm just wrong about the issue.

Goal

We want our shift managers to be notified of service outages if:

  • The service is down for a defined period.
  • Notification should be sent even if the issue has been acknowledged.

From the Nagios docs:

For notifications:

Notifications are escalated if and only if one or more escalation
definitions matches the current notification that is being sent out.

For acknowledgements:

Allows you to acknowledge the current problem for the specified
service. By acknowledging the current problem, future notifications
(for the same servicestate) are disabled.
If the "sticky" option is
set to one (1), the acknowledgement will remain until the service
returns to an OK state. Otherwise the acknowledgement will
automatically be removed when the service changes state. If the
"notify" option is set to one (1), a notification will be sent out to
contacts indicating that the current service problem has been
acknowledged. If the "persistent" option is set to one (1), the
comment associated with the acknowledgement will survive across
restarts of the Nagios process. If not, the comment will be deleted
the next time Nagios restarts.

My understanding is that if the issue is acknowledged then there are no further notifications – I assume this applies to escalation notifications as well?

I don't see a way around this.

Our work flow requires the L1 team to acknowledge the issue if they can handle it and escalate as needed. However, we would like to put in place an automatic process to assure these escalations happen.

Nagios is where I would like to do this but if not possible, we may have to handle this on the ticketing side.

Thanks!

Best Answer

I have a Perl script that does this. You simply need to scan the 'status.dat' file for:

host checks > last_time_up = <value>
service checks > last_time_ok = <value>

Both store an epoch value, which, if greater than a certain number of seconds from current epoch, will cause the check_description and/or host_name to be added to the email sent out. My script also checks the 'problem_has_been_acknowledged' entry and lets me know if it has been Ack'ed. Said script fires off from a crontab entry every 30 minutes sending out a listing of all host/service checks that matched.

Related Topic