Monitoring / metric collection for system collectives that change a lot in time (a.k.a. cloud)

cloudmetricsmonitoring

When your server fleet doesn't change a lot in time, like when you're using bare-metal hosting, classic monitoring and metric collection solutions (Nagios, Munin) work well.

But if the number of systems varies a lot in time, and may in fact vary rapidly, classic software is more difficult to setup and use. E.g., trying to make Nagios (monitoring) keep up with a rapidly evolving cloud infrastructure can be cumbersome. Same for Munin (metric collection). It's not just the configuration, but the way the information is conveyed to the user, or displayed, is inadequate for the cloud.

What are some possible alternatives that work well with the cloud? The goals are to collect and display metrics (analog to Munin), and generate alerts when certain metrics go out of bounds or when certain services are unavailable (analog to Nagios), and do everything in a cloud-friendly manner.

Some cloud providers offer monitoring / metric collection as services, but not always, and if you use more than one provider you don't want to become too dependent of just one vendor. So provider-independent solutions are required.

EDIT: I am asking this question in a general fashion – not limited to any given cloud infrastructure (like OpenStack), but in the general case of using arbitrary cloud providers.

Best Answer

For systems that are short-lived or where the infrastructure changes often, I use two different tools to handle monitoring. I added a comment asking which metrics were most important to you, and it seems like you're looking for basic "what happened when?" monitoring stats with some alerting...

As systems and hardware are abstracted more via cloud services and virtualization, some of the traditional monitoring tools are less useful because you may not care about physical hardware resources and health. Application and virtual resources (from the perspective of the VM/instance/container) are what matter.

Both of the examples I give below are entirely hands-off and a default in my environments. Reinforced by Puppet, I can ensure that all systems are capturing and reporting their performance.

Pick #1 - New Relic

New Relic monitoring is agent based and quite easy to slipstream into a provisioning or configuration management system. In my case, every server I deploy gets a Puppetized New Relic configuration, registers itself with my New Relic account and is available in the monitoring dashboard around ~30-60 seconds from install. The host pushed data over standard ports, so this works well across environments. The system can unregister itself on teardown.

Main positives are 60-second granularity, live dashboard/kiosk view, it's free for server monitoring and is clean and presentable in a manner acceptable to end-users and clients.

enter image description here

enter image description here

enter image description here


Pick #2 - Monit and M/Monit

Monit is incredibly handy for application and basic system monitoring. Monit is an agent that is easily installed on target systems via native OS package management. It can be tailored to monitor custom applications and their relevant parameters, as well as taking actions based on those metrics. M/Monit adds a degree of centralization to the Monit checks, and allows you to aggregate data for analysis and light graphing.

Being agent-based, it's also easy to push configs to hosts in an automated fashion. I also use Puppet for this, with some creative tempting to build the confutations files. Upon initialization, new servers will register with the central M/Monit daemon over http/https ports, so firewalls and monitoring of multiple locations is not an issue.

enter image description here

enter image description here

enter image description here

enter image description here

Related Topic