Linux – Optimal configuration of “Processor load is too high” trigger in Zabbix

cpu-usagelinuxperformance-monitoringzabbix

I monitor approx. 10 Linux servers with 4 CPU cores each with Zabbix.
I was receiving way to many false alarms from "Processor load is too high" trigger lately.
The "Processor load is too high" trigger expression was:

{Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>5 

which is default.

Then I raised 5 to 12 to get less alarms, but somehow thought this is not the best way to deal with it.
Therefore I made some Googling and constructed a new trigger.

{Template OS Linux:system.cpu.util[,user].max(5m)}>75

I'd ask the community:

  1. Will new expression reflect REAL CPU overload better than original
    one?
  2. Would you do it somehow different/better/more optimized?
  3. How would you compose an expression, which would do this:
    The trigger will fire if:

    • 5 min average number of processes waiting in perCPU queue will be more than 3
      AND
    • maximum CPU utilization during the last 5 minutes will be higher than 75 %

I followed the examples in some article and tried with

({Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>3
&
{Template OS Linux:system.cpu.util[,user].max(5m)}>75)

but I failed.
Zabbix server returned error:
Incorrect trigger expression. Check expression part starting from " & {Template OS Linux:system.cpu.util[,user].max(5m)}>75)".
Since I'm not some hi expert on Zabbix (yet), the comments will be greatly appretiated.
Thanks.

Best Answer

Why is "Processor load is too high" false alarm in your case? It's real symptom for me - CPU is saturated.

IMHO: use only

{Template OS Linux:system.cpu.load[percpu,avg1].avg(5m)}>5 

but threshold depends on your server - what and how is it doing. But >5 value is suspicious for me. Example: CPU usage can be low, but CPU load high - in this case it can be symptom for "slow" IO disk operations (you will need to check metrics CPU iowait usage, disk queue length, ...). Your new combined trigger expression doesn't catch this case.

I recommend article about utilization/saturation from Senior Performance Architect at Netflix: http://www.brendangregg.com/usemethod.html