Nagios Monitoring – Recommended Warning and Critical Values for check_load

monitoringnagios

Right now I am using these values:

# y = c * p / 100
# y: nagios value
# c: number of cores
# p: wanted load procent

# 4 cores
# time        5 minutes    10 minutes     15 minutes
# warning:    90%          70%            50%
# critical:   100%         80%            60%
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4

But these values are just picked almost random.

Does anyone have some tested values?

Best Answer

Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.

 1 min load avg = load_core_1 + load_core_2 + ... + load_core_n
 5 min load avg = load_core_1 + load_core_2 + ... + load_core_n
15 min load avg = load_core_1 + load_core_2 + ... + load_core_n

where 0 < avg load < infinity.

So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.

check_load now have

 -r, --percpu
    Divide the load averages by the number of CPUs (when possible)

which means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With -r the warning and critical intervals becomes 0 <= load avg <= 1. Ie. you don't have to modify your warning and critical values from server to server.

OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.

Related Solutions

Nagios check_total_procs with Default Values

Every server is different - web servers in particular tend to have a lot of processes, especially if they are busy.

The best thing you could do would be to monitor your server over the space of a week of normal operation, and see how many processes are normal for your server, then configure Nagios appropriately.

Don't pay any attention to defaults like this, there is no such thing as a typical server!

Debug a Nagios NRPE command

You could let the script log something to a file, e.g.:

ps aux > /tmp/debugfile

An alternative would be using the generic check_procs:

/usr/lib/nagios/plugins/check_procs -c 1:1 -C fail2ban-server

Best Answer

Related Solutions

Nagios check_total_procs with Default Values

Debug a Nagios NRPE command

Related Topic