I have fixed this issue.
The reason for this issue was, I was using {ITEM.LASTVALUE}
in my Action, which caused this problem. I have changed this to {ITEM.VALUE}
. Which fixed the issue.
{ITEM.VALUE}
is faster than {ITEM.LASTVALUE}
.
If you want to know more about this fix, please see the detailed explanation in Zabbix bug tracker
I think that the bottleneck is the disc. Here are my reasons for this:
You have a pretty busy web server.
Zabbix is slow, I suspect to be reads from the disk (can be from the network too).
Run again the strace, and find the file descriptor in Zabbix
Then find if the file descriptor is a file or a socket:
ls -l /prod/<PID_of_straced_process>/fd/<FD_from_strace>
EDIT1:
You should not change the TIME_WAIT timeouts. The problem with small HTTP keep-alive, or with no HTTP keep-alive is that you increase the latency and bandwidth. Instead you should increase a little bit the HTTP keep-alive and install/enable SPDY.
EDIT2:
Use dstat -ta 10
and compare the first line with the rest. The first line is the average since boot. Next lines are 10 seconds average (the last parameter).
EDIT3:
Check if you do not have packets lost, use something like smokeping to monitor the server and the website from outside your network. You have a significant number of connections in CLOSING, FIN_WAIT1, FIN_WAIT2, SYN_RECV, LAST_ACK. I think your network is congested or you have a lot of short lived connections (confirmed by the high TIME_WAIT/ESTABILISHED ratio). See: http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation
Best Answer
Use log file monitoring. For example:
You can then set a trigger on it, if you wish:
A tutorial is available on the Internet that expands on this a bit.