Linux – What metrics should I monitor on the Linux-server

linuxmonitoringmuninnagios

I've been tasked with setting up monitoring of 300 servers, doing different things. I've been looking at various tools, such as Nagios, Munin, and others – so I got a pretty good idea on how I can achieve monitoring in the first place.

What I'm wondering, is what metrics are usual to monitor as a good default in the event where I don't know much about the server? And, what are "sane defaults" as far as alerting go?

My plan is to deploy a monitoring scheme with sane defaults as a start, while I map out the roles of the different systems – which I expect will take some time.

This question can also be asked in a different way:

If you were designing a monitoring-appliance – what should its default Linux-monitoring template contain?

Best Answer

The usual metrics which indicate problems include cpu utilization, memory utilization, load average, and disk utilization. For mail servers, the size of the mail queue is an important indicator. For web servers, the number of busy servers is an important measure. Excessive network throughput also leads to problems. If you have processes which need to check times NTP can be an important tool in keeping clocks in sync.

Standard warning levels I have used include (warning, critical). You may want to adjust your values based on a number of factors. Higher values reduce the number of alerts, while lower values give you more time to react to developing problems. This might be a suitable starting point for a template.

  • Sustained CPU utilization (80%, 100%). Exclude time for niced processes.
  • Load average per CPU (2, 5).
  • Disk utilization per partition (80%, 90%).
  • Mail queue (10, 50). Use lower values on non mail servers.
  • Busy web servers (10, 25).
  • Network throughput (80%, 100%). Network backups and other such process may exceed values. I would use throttling settings if they are available.
  • NTP offset in seconds ( 0.2, 1).

Munin does a good job gathering these statistics and others. It also has the capability to trigger alarms when thresholds are passed. Its warning capabilities are not as good as those of Nagios. Its gathering and display of historical data makes it a good choice to be able to review whether the current values differ significantly from past values. It is easy to setup and can be run without generating warnings. The main problem is volume of data captured, and its fixed frequency of gathering information. You may want to generate graphs on demand. Munin provides many of the statistics I would check using sar when a system was in trouble. It's overview page is useful for identifying possible problems.

Nagios is very good at alerting, but has historically not been very good at gathering historical data in a manner suitable for comparison to current values. It appears this is changing and the new release is much better at gathering this data. It is a good choice for generating warnings when there are problems, and scheduling outages during which alerts are not generated. Nagios is very good at alerting when services go down. This is especially suitable for critical servers and services.