Zabbix agent – high CPU usage

zabbix

I am monitoring a host with the help of Zabbix and I noticed that Zabbix agent started using quite a lot of CPU cycles:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                         
26774 zabbix    20   0 68428 1312  752 R   99  0.0  63:27.67 /usr/sbin/zabbix_agentd                                                                                                                         
26773 zabbix    20   0 68428 1324  764 R   99  0.0  63:26.33 /usr/sbin/zabbix_agentd

There are about 100 items monitored with the agent. They are also monitored on other identical hosts where Zabbix agent does not consume so much of CPU. Agents send collected data to Zabbix proxy. The agent configuration is default. The host CPU has 8 cores (2.4 Gz). The smallest time value for monitored items is 60 seconds.

I use Zabbix server / agent 1.8.11 and I can't upgrade to 2.2 at least now.

I checked debug log from all sides: Zabbix server, proxy, agent and can't find any issues there. Just usual checks received and sent all of the time.

I don't know how to investigate this issue further and asking for community help. How could I trace why agent is consuming CPU so hard?

One more thing that is looking strange for me is stats of the network connections:

netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
      2 CLOSE_WAIT
     21 CLOSING
   3521 ESTABLISHED
   2615 FIN_WAIT1
    671 FIN_WAIT2
   1542 LAST_ACK
     14 LISTEN
    256 SYN_RECV
 117841 TIME_WAIT

Thank you.

Update 1.

netstat -tnp|grep zabbix

    tcp        1      0 10.120.0.3:10050        10.128.0.15:53372        CLOSE_WAIT  23777/zabbix_agentd
    tcp        1      0 10.120.0.3:10050        10.128.0.15:53970        CLOSE_WAIT  23775/zabbix_agentd
    tcp        1      0 10.120.0.3:10050        10.128.0.15:53111        CLOSE_WAIT  23776/zabbix_agentd

10.128.0.15 – IP of Zabbix server
10.120.0.3 – IP of Zabbix host

Update 2.

Those TIME_WAIT connections are from web server nginx.

Update 3.

I attached to the Zabbix agent process with strace and it appeared that 100% is used by agents on the read syscall:

strace -C -f -p 23776

Process 23776 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    2.175528        2515       865           read
------ ----------- ----------- --------- --------- ----------------
100.00    2.175528                   865           total

Update 4.

Just to get all things clear… I tried to work with the TIME_WAIT connections state. For example, I tried decreasing net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait and net.netfilter.nf_conntrack_tcp_timeout_time_wait and see if it helps. Unfortunately, it did not help.

Conclusion

The Zabbix agent CPU load issue appeared to be bound with the network connections number. If we attach to the zabbix_agentd process using strace, we will see how CPU cycles are used (1-st column – CPU time spent running in the kernel):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   15.252232        8646      1764           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0         1           open
...
------ ----------- ----------- --------- --------- ----------------
100.00   15.252232                  1778           total

Here most of the CPU time is used on the read system calls. Further investigation showed that these read calls (2 of them are shown below) are continious attempts to read the /proc/net/tcp file. The latter contains network statistic such as TCP and UDP connections, sockets, etc. In average the file contains 70000-150000 entries.

8048       0.000068 open("/proc/net/tcp", O_RDONLY) = 7 <0.000066>
8048       0.000117 fstat(7, {st_dev=makedev(0, 3), st_ino=4026531993, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2013/04/01-09:33:57, st_mtime=2013/04/01-09:33:57, st_ctime=2013/04/01-09:33:57}) = 0 <0.000012>
8048       0.000093 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30a0d38000 <0.000033>
8048       0.000087 read(7, "  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode    "..., 1024) = 1024 <0.000091>
8048       0.000170 read(7, "                         \n   6: 0300810A:0050 9275CE75:E67D 03 00000000:00000000 01:00000047 0000000"..., 1024) = 1024 <0.000063>

Best Answer

I think that the bottleneck is the disc. Here are my reasons for this:

You have a pretty busy web server. Zabbix is slow, I suspect to be reads from the disk (can be from the network too).

Run again the strace, and find the file descriptor in Zabbix

Then find if the file descriptor is a file or a socket:

ls -l /prod/<PID_of_straced_process>/fd/<FD_from_strace>

EDIT1:
You should not change the TIME_WAIT timeouts. The problem with small HTTP keep-alive, or with no HTTP keep-alive is that you increase the latency and bandwidth. Instead you should increase a little bit the HTTP keep-alive and install/enable SPDY.

EDIT2: Use dstat -ta 10 and compare the first line with the rest. The first line is the average since boot. Next lines are 10 seconds average (the last parameter).

EDIT3: Check if you do not have packets lost, use something like smokeping to monitor the server and the website from outside your network. You have a significant number of connections in CLOSING, FIN_WAIT1, FIN_WAIT2, SYN_RECV, LAST_ACK. I think your network is congested or you have a lot of short lived connections (confirmed by the high TIME_WAIT/ESTABILISHED ratio). See: http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation

Related Topic