Linux – What metrics should I monitor on the Linux-server

linuxmonitoringmuninnagios

I've been tasked with setting up monitoring of 300 servers, doing different things. I've been looking at various tools, such as Nagios, Munin, and others – so I got a pretty good idea on how I can achieve monitoring in the first place.

What I'm wondering, is what metrics are usual to monitor as a good default in the event where I don't know much about the server? And, what are "sane defaults" as far as alerting go?

My plan is to deploy a monitoring scheme with sane defaults as a start, while I map out the roles of the different systems – which I expect will take some time.

This question can also be asked in a different way:

If you were designing a monitoring-appliance – what should its default Linux-monitoring template contain?

Best Answer

The usual metrics which indicate problems include cpu utilization, memory utilization, load average, and disk utilization. For mail servers, the size of the mail queue is an important indicator. For web servers, the number of busy servers is an important measure. Excessive network throughput also leads to problems. If you have processes which need to check times NTP can be an important tool in keeping clocks in sync.

Standard warning levels I have used include (warning, critical). You may want to adjust your values based on a number of factors. Higher values reduce the number of alerts, while lower values give you more time to react to developing problems. This might be a suitable starting point for a template.

Sustained CPU utilization (80%, 100%). Exclude time for niced processes.
Load average per CPU (2, 5).
Disk utilization per partition (80%, 90%).
Mail queue (10, 50). Use lower values on non mail servers.
Busy web servers (10, 25).
Network throughput (80%, 100%). Network backups and other such process may exceed values. I would use throttling settings if they are available.
NTP offset in seconds ( 0.2, 1).

Munin does a good job gathering these statistics and others. It also has the capability to trigger alarms when thresholds are passed. Its warning capabilities are not as good as those of Nagios. Its gathering and display of historical data makes it a good choice to be able to review whether the current values differ significantly from past values. It is easy to setup and can be run without generating warnings. The main problem is volume of data captured, and its fixed frequency of gathering information. You may want to generate graphs on demand. Munin provides many of the statistics I would check using sar when a system was in trouble. It's overview page is useful for identifying possible problems.

Nagios is very good at alerting, but has historically not been very good at gathering historical data in a manner suitable for comparison to current values. It appears this is changing and the new release is much better at gathering this data. It is a good choice for generating warnings when there are problems, and scheduling outages during which alerts are not generated. Nagios is very good at alerting when services go down. This is especially suitable for critical servers and services.

Related Solutions

What tool do you use to monitor your servers

I've used Nagios in the past with success. It's very extensible (over 200 add-ons), relatively easy to use and lots of reports. A negative would be the initial setup.

Linux – How to run a server on port 80 as a normal user on Linux

Short answer: you can't. Ports below 1024 can be opened only by root. As per comment - well, you can, using CAP_NET_BIND_SERVICE, but that approach, applied to java bin will make any java program to be run with this setting, which is undesirable, if not a security risk.

The long answer: you can redirect connections on port 80 to some other port you can open as normal user.

Run as root:

# iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080

As loopback devices (like localhost) do not use the prerouting rules, if you need to use localhost, etc., add this rule as well (thanks @Francesco):

# iptables -t nat -I OUTPUT -p tcp -d 127.0.0.1 --dport 80 -j REDIRECT --to-ports 8080

NOTE: The above solution is not well suited for multi-user systems, as any user can open port 8080 (or any other high port you decide to use), thus intercepting the traffic. (Credits to CesarB).

EDIT: as per comment question - to delete the above rule:

# iptables -t nat --line-numbers -n -L

This will output something like:

Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination         
1    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:8080 redir ports 8088
2    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:80 redir ports 8080

The rule you are interested in is nr. 2, so to delete it:

# iptables -t nat -D PREROUTING 2

Best Answer

Related Solutions

What tool do you use to monitor your servers

Linux – How to run a server on port 80 as a normal user on Linux

Related Topic