Geographically distributed, fault-tolerant and “intelligent” application/host monitoring systems

monitoringnagiossla

Greetings,

I'd like to ask the collectives opinion and view on distributed monitoring systems, what do you use and what are you aware of which might tick my boxes?

The requirements are quite complex;

  • No single point of failure. Really. I'm dead serious! Needs to be able to tolerate single/multiple node failure, both 'master' and 'worker' and you may assume that no monitoring location ("site") has multiple nodes in it, or are on the same network. Therefore this probably rules out traditional HA techniques such as DRBD or Keepalive.

  • Distributed logic, I would like to be deploying 5+ nodes across multiple networks, within multiple datacentres and on multiple continents. I want the "Birds Eye" view of my network and applications from the perspective of my customers, bonus points for the monitoring logic not getting bogged down when you have 50+ nodes, or even 500+ nodes.

  • Needs to be able to handle a fairly reasonable number of host/service checks, a la Nagios, for ballpark figures assume 1500-2500 hosts and 30 services per host. It'd be really nice if adding more monitoring nodes allowed you to scale relatively linearly, perhaps in 5 years time I might be looking to monitor 5000 hosts and 40 services per host! Adding on from my note above about 'distributed logic' it'd be nice to say:

    • In normal circumstances, these checks must run on $n or n% of monitoring nodes.
    • If a failure is detected, run checks on another $n or n% of nodes, correlate the results and then use them to decide whether criteria has been met to issue an alert.
  • Graphs and management friendly features. We need to track our SLAs and knowing whether our 'highly available' applications are up 24×7 is somewhat useful. Ideally your proposed solution should do reporting "out of the box" with minimal faff.

  • Must have a solid API or plugin system for developing of bespoke checks.

  • Needs to be sensible about alerts. I don't want to necessarily know (via SMS, at 3am!) that one monitoring node reckons my core router is down. I do want to know if a defined percentage of them agree that something funky is going on 😉 Essentially what I'm talking about here is "quorum" logic, or the application of sanity to distributed madness!

I'm willing to consider both commercial and open source options, although I'd prefer to steer clear of software costing millions of pounds 🙂 I'm also willing to accept there may be nothing out there which ticks all those boxes, but wanted to ask the collective that.

When thinking about monitoring nodes and their placement, bear in mind most of these will be dedicated servers on random ISPs networks and thus largely out of my sphere of control. Solutions which rely on BGP feeds and other complex networking antics likely won't suit.

I should also point out that I've either evaluated, deployed or heavily used/customized most of the open source flavours in the past including Nagios, Zabbix and friends — they're really not bad tools but they fall flat on the whole "distributed" aspect, particularly with regards to the logic discussed in my question and 'intelligent' alerts.

Happy to clarify any points required. Cheers guys and gals 🙂

Best Answer

not an answer really, but some pointers:

  • definitivly take a look at presentation about nagios @ goldman sachs. they faced problems you mention - redundancy, scalability: thousands of hosts, also automated configuration generation.

  • i had redundant nagios setup but at much smaller scale - 80 servers, ~1k services in total. one dedicated master server, one slave server pulling configuration from master at regular intervals few times a day. both servers covered monitoring of the same machines, they had health cross-check between each other. i used nagios mostly as framework for invoking custom product specific checks [ bunch of cron jobs executing scripts doing 'artificial flow controls', results ware logged to sql, nrpe plugins ware checking for successful / failed executions of those in last x minutes ]. all worked very nicely.

  • your quorum logic sounds good - a bit similar to my 'artificial flows' - basically go on, ipmplement your self ;-]. and have nrpe just check some kind of flag [ or sql db with timestamp-status ] how things are doing.

  • you'll probably want to build some hierarchy to scale - you'll have some nodes that gather overview of other nodes, do look at presentation from first point. default nagios forking for every single check is overkill at higher number of monitored services.

to answer some questions:

  • in my case environment monitored was typical master-slave setup [ primary sql or app server + hot standby ], no master-master.
  • my setup involved 'human filtering factor' - resolver group who was a 'backup' for sms notification. there was already paid group of technicians who for other reasons had 24/5 shifts, they got 'checking nagios mails' as additional task not putting too much load on them. and they ware in charge of making sure that db-admins / it-ops / app-admins ware actually getting up and fixing problems ;-]
  • i've heard lot's of good things about zabbix - for alerting and plotting trends, but never used it. for me munin does the trick, i have hacked simple nagios plugin checking if there is 'any red' [ critical ] color on munin list of servers - just an additional check. you can as well read values from munin rrd-files to decrease number of queries you send to monitored machine.
Related Topic