Nagios automation in big scale

chefnagiosscalability

I would like to know if you have an experience or any idea about how to set up nagios in big scale.

Previously we used nagios and nagiosql for manual settings, it was pretty comfortable for few servers.

Recently number of server has changed and manual configuration by nagiosql became uncomfortable. We use chef for starting new instances, I would like to know if there are good practices for using chef and nagios together. As one option, we can use just nagios and rewrite configuration files of nagios (based on server role) every time when we start new instance.

For example, the scenario could be like this, having started new mysql server, there is a dedicated recipe for rewriting nagios setting file. Recipe can get all data from chef data-bags about every server and build settings based on roles in chef.

Best Answer

I've implemented three slightly different solutions for Nagios monitoring using Chef over the last 18 months. They're all based around Chef's template resource for generating configuration files using the ERB syntax and that bit has worked really well. You have a Ruby array or hash of hosts and services, and Nagios configuration files are generated. It's pretty easy to test and debug.

  1. Completely data bag-based configuration. In this case there's a nagios_hosts and a nagios_services data bag and each host has a key that says which service checks get run, e.g check_load, check_disk. This setup is quick to get going and works reasonably well, although if hosts are deleted or new ones added someone has to be around to update the data bags. In practice it's easy to forget about this and things can get out of date which can lead to trouble.
  2. Chef attribute-based configuration. Here I used the Chef REST API to query one or more Chef servers to pull down lists of nodes and assign service checks to them based on roles they were assigned. Having a dependency on Chef means that it's difficult to monitor non-Chef systems, e.g appliances, network devices, or nodes that don't run Chef for whatever reason. Chef ends up sending a huge amount of JSON data over the network for large numbers of nodes and processing all this data puts a load on the Chef server(s) as well as the Nagios server when it generates configuration files.
  3. Rails app generating Nagios configuration files. I ended up breaking the Chef dependency by storing Nagios configuration information in a database and having a Rails app generate the configuration files. Each Nagios server makes a REST request and downloads it's configuration files that are generated using ERB and a MySQL database. It's quite a bit of work to get this going, but so far it's working well for monitoring Chef and non-Chef nodes.

So after going through all of that I would probably recommend using something like option #2 for small (tens to hundreds) of nodes. I would try and keep it simple though. I used Chef's attribute system to define and override thresholds for the service checks based on roles and while it works, it's way too complicated and the cookbook has ended up becoming an unmaintainable mess.

Good luck!