Recently have been getting Zabbix alerts about our mail system being unavailable, the uptime on the machine is 30+ days. I've been tracing the Zabbix logs and it looks like the communication between the Zabbix agent & server failed to respond in time which triggered the alert.
To find out if it was a network issue, etc. I viewed /var/log/messages and found the following entries:
Nov 14 21:48:49 iw kernel: INFO: task zabbix_agentd:3316 blocked for more than 120 seconds.
Nov 14 21:48:49 iw kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 14 21:48:49 iw kernel: zabbix_agentd D 0000000000000003 0 3316 3311 0x00000080
Nov 14 21:48:49 iw kernel: ffff880069075c50 0000000000000086 ffffffff817a8d69 ffff880069075c68
Nov 14 21:48:49 iw kernel: ffff880486ea3000 ffff880069075c58 ffffffff8127cb66 0000000000000009
Nov 14 21:48:49 iw kernel: ffff88042085bab8 ffff880069075fd8 000000000000fb88 ffff88042085bab8
Nov 14 21:48:49 iw kernel: Call Trace:
Nov 14 21:48:49 iw kernel: [<ffffffff8127cb66>] ? vsnprintf+0x2b6/0x5f0
Nov 14 21:48:49 iw kernel: [<ffffffff814ffec5>] rwsem_down_failed_common+0x95/0x1d0
Nov 14 21:48:49 iw kernel: [<ffffffff81500056>] rwsem_down_read_failed+0x26/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff8127e664>] call_rwsem_down_read_failed+0x14/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30
Nov 14 21:48:49 iw kernel: [<ffffffff81140511>] __access_remote_vm+0x41/0x1f0
Nov 14 21:48:49 iw kernel: [<ffffffff81144052>] ? vma_merge+0x1d2/0x3e0
Nov 14 21:48:49 iw kernel: [<ffffffff8114071b>] access_process_vm+0x5b/0x80
Nov 14 21:48:49 iw kernel: [<ffffffff811e295d>] proc_pid_cmdline+0x6d/0x120
Nov 14 21:48:49 iw kernel: [<ffffffff8115c30a>] ? alloc_pages_current+0xaa/0x110
Nov 14 21:48:49 iw kernel: [<ffffffff811e357d>] proc_info_read+0xad/0xf0
Nov 14 21:48:49 iw kernel: [<ffffffff8117b9e5>] vfs_read+0xb5/0x1a0
Nov 14 21:48:49 iw kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
Nov 14 21:48:49 iw kernel: [<ffffffff8117bb21>] sys_read+0x51/0x90
Nov 14 21:48:49 iw kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Kernel info:
Linux mail 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Memory info:
total used free shared buffers cached
Mem: 24031 21497 2533 0 606 14562
-/+ buffers/cache: 6328 17702
Swap: 31999 49 31950
I'm looking for some guidance on where to begin narrowing down the root cause of these issues.
Best Answer
Found this post, not sure if it applies to you or not. http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/
How much CPU do you have? Looks like you have quite a bit of memory (24GB). If the blog post is correct then you system may not be able to dump memory from the cache fast enough to deal with the IO you have coming it.
You can set "vm.dirty_ratio=10″ in /etc/sysctl.conf to force it to flush sooner. This may help your issue.