Linux – Nagios – Not sure which interval should be changed in order to limit number of times a notification is sent when an error occurs

linuxnagios

I have a Nagios server which monitors many servers.
From time to time we encounter an error which cannot be resolved at the time and we are leaving it for the time being.
When it happens, we keep getting email notifications regarding the failing service.
So if we're not dealing with the problem in the next day – we receive about 500 email notifications regarding it.
Now my question is, what is the difference between notification_interval and interval_length and which value should i be editing?
I'm willing to configure it in a way that when an error occurs I will get only 1 notification regarding the issue instead of getting 10 notifications per hour for example.
I want Nagios to send me an email only once when an error occurs and every 12 hours until the error is fixed.
How can it be achieved?

Best Answer

You should probably leave these settings alone and use the acknowledgement feature in Nagios.

This allows you to tell Nagios you know about the issue, and it will then suppress notifications going out until the status changes (i.e. it gets worse, or starts flapping, or the error goes away, in which case the alerts will also stop).

See Acknowledge_Host_Problem for a better explanation of what this does. Sorry, I can't find a more current page than this, but it explains the concept well enough.

To directly answer your question, even though I think there is a better way:

  • interval_length is a number of seconds - by default 60
  • notification_interval is the number of interval lengths you want between notifications. If you left interval_length alone this would be the number of minutes between notifications.

So to get 12 hours between notifications, you could set notification_interval to 720, and leave interval_length alone.

But I still think the acknowledgement setting is better because it allows Nagios to keep nagging your team till they take some sort of action.

Note that, either way, Nagios may still send notifications more frequently depending on what is going on. I've had alerts relating to CPU use, where it oscillated between just above and just below the critical threshold - no matter what I did, every time it went over the Critical threshold value, an alert went out. The flapping detection in Nagios is used to handle these situations. Or you might want to look at your alert thresholds.

Related Topic