Zabbix trigger won’t work for count function

zabbix

I'm trying to throw an alert if server is dropping below threshold for free memory for x number of times in a day.

{my_template:vm.memory.size[free].count(1m,5G,lt,1d)}>5
{my_template:vm.memory.size[free].count(1m,5368709120,lt,1d)}>5

I've also tried this when free memory is 9G…but failed.

{my_template:vm.memory.size[free].count(1m,5G,gt,1d)}>5

Best Answer

The Zabbix documentation for the count function specifies the options as follows:

count (sec|#num,,,)

With regards to time_shift, this explains in more detail what it does.

Several functions support an additional, second time_shift parameter. This parameter allows to reference data from a period of time in the past. For example, avg(1h,1d) will return the average value for an hour one day ago.

Your examples use 1m in the first argument, which means that they only look at a time period of one minute, and by time shifting it 1d, you're looking at a time period of 1 minute, exactly 24 hours ago. That doesn't seem like what you want to watch.

You seem to be using the second and third parameters correctly, as well as the operator outside the function.

To get the trigger as you described it, I'd forgo the time_shift and set the first parameter to 1d.

This is probably closer to what you describe:

{my_template:vm.memory.size[free].count(1d,5368709120,lt)}>5

It's important to note, however, that the count function is heavily reliant on how many data points have been gathered in the specified time period, which depends on the item monitoring interval.

In this example below, Zabbix is listing the data gathered for memory in the past 24 hours. Since the interval is set to 30 seconds, that gives 2880 data points.

zabbix screenshot

When you say you want to the trigger to fire after the count function returns >5, what that means is that it will fire when more than 5/2880 data points meet the criteria.

This can be >5 points spread throughout the day, or >5 consecutive points, meaning that it happened once, for 2.5 minutes.

What would probably be a better idea would be to create a new Calculated item. Let's call it "5 minute memory dip". I'll give it the key "foo.bar.free.memory.low". It could use this formula:

max(vm.memory.size[free], 5m)<5368709120

It will store a 1 when the highest value for free memory in the last 5 minutes was below 5G, otherwise, a 0.

Then, create a trigger based on that new item:

{my_template:foo.bar.free.memory.low.count(1d,0,gt)}>5

This trigger will fire when there have been >5 such dips in the past day.

This method should really cut down on the false positives and more reliably count the real memory dips.