Good monitoring, alerting tool with a trouble ticketing system + de-duplication & intelligent suppression of alerts

alertsmonitoringnagios

I have been a nagios user for a long time.

Of late, as the size of our server fleet grew, so did the number of alerts from nagios. Signal-to-noise ratio has become very low.
eg. When a common service fails – all my load-balanced web-server which use that service & hence check for it start alerting. That mixed with system alerts possible from that service appearing in different order lead to a lot of noise.

I can spend a lot of time & ensure my nagios configurations are good, but it is increasingly becoming unmanageable. I am looking for a tool (or nagios plugin) that does de-duplication & intelligent suppression of alerts.
Also, I would want "issues"/outages to be tracked in a trouble ticketing system – so that there is 1 place for anyone to get a good handle of whats happening with an issue. And also look at the archive.

Yes, I can do it to some extent in Nagios – but its not great.

While looking I found tons of tools ( http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html#public ) but nobody seems to be talking of issues like de-duplication, issue tracking & management

Best Answer

I'd say your best bet is OpenNMS with RT or OTRS integration. Unlike Nagios, it's a complete SNMP management solution with an FCAPS (fault/configuration/accounting/performance/security management) focus. How well it tackles each one of those categories is sort of up to the implementer. It's a great solution for people who are looking to "upgrade" from Nagios and have a Cacti server sitting around doing similar things. The integration of the performance and fault data is absolutely indispensable. The documentation is sort of behind the current state of the product, but I've been personally working on this as of late.

If you want to give it a try, go ahead and follow the quick start instructions on the opennms.org wiki, but stop at "discovery", and take a look at the new provisiond feature whitepaper. It's a great migration tool as well.

The event based system it provides triggers alarms for an alarm panel and notifications for... notifications. These can be phone calls via asterisk, pages, email, twitter, etc. When you or on-call staff are notified, you can reply to the email with the work "ack" and have the notification acknowledged and your ticket updated with start times, etc.

The separation of notifications and alarms is a great feature for your de-duplication request. Depending on what's going on, you can reduce these alarms by a reduction key and only be notified on the threshold (but still have all the alarm triggered so you have the data). There's some advanced correlation features, but I haven't really dug into it.

Related Topic