Nagios Monitoring – Recommended Warning and Critical Values for check_load

monitoringnagios

Right now I am using these values:

# y = c * p / 100
# y: nagios value
# c: number of cores
# p: wanted load procent

# 4 cores
# time        5 minutes    10 minutes     15 minutes
# warning:    90%          70%            50%
# critical:   100%         80%            60%
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4

But these values are just picked almost random.

Does anyone have some tested values?

Best Answer

Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.

 1 min load avg = load_core_1 + load_core_2 + ... + load_core_n
 5 min load avg = load_core_1 + load_core_2 + ... + load_core_n
15 min load avg = load_core_1 + load_core_2 + ... + load_core_n

where 0 < avg load < infinity.

So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.

check_load now have

 -r, --percpu
    Divide the load averages by the number of CPUs (when possible)

which means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With -r the warning and critical intervals becomes 0 <= load avg <= 1. Ie. you don't have to modify your warning and critical values from server to server.

OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.

Related Topic