Centos – snmpd becomes unresponsive (Centos 6)

centosnet-snmpsnmp

In the last couple of days, a Centos 6.7 mailserver on which I am monitoring is cotinuously timing out on snmp queries. No changes have been made to the server in the period immediately preceding the change of behaviour (I know…), so I am inclined to blame "something" in the environment.

If I restart the daemon, it will be responsive again for a few minutes (up to a couple of hours) then it will start timing out again. This will also happen for queries run from the machine itself as in

# snmpstatus -v1 -c public localhost

(so no network issues in the picture). I have nothing of note in dmesg and the only things I can see in /var/log/messages – that are not ordinary snmp connect traces – are occasional:

Mar 22 17:34:53 turnip snmpd[31053]: read:Interrupted system call

lines which appear related to me restarting the daemon.

I tried to strace snmpd and I can see it waiting in what appears to be a select/receive loop – when unresponsive, it never gets out of there and it does not write anything in the logs – it is as if packets are not delivered to the daemon. But rebooting the machine has no effect.

Also ineffective has been trying to tweak open files limit and surveying other possible resource limits – not to mention the machine itself is not particularly stressed. So I am currently out of clues.

I can post snmpd.conf if needed.

TIA & cheers

Edit: this is what the traced loop looks like (while unresponsive):

select(15, [14], NULL, NULL, {0, 27618}) = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)

Best Answer

And it turns out that my snmpd daemon runs a number of (shell) commands - specified in the extensible section of snmpd.conf. One of those (for reasons yet to be determined) started to become wedged now and then. The stupid snmpd daemon got stucked reading from that command and the whole shebang timed out.

The way I found out may be of interest.

1) find snmpd's pid:

#pidof snmpd
124567

2) strace it:

# strace -p124567
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)
select(15, [14], NULL, NULL, {1, 0})    = 0 (Timeout)

3) 14 is the file descriptor on which snmpd is stuck. Now find its inode:

# ls -l /proc/1124567/fd
total 0
lrwx------ 1 root root 64 Mar 23 10:41 0 -> /dev/null
lrwx------ 1 root root 64 Mar 23 10:41 1 -> /dev/null
[...]
lr-x------ 1 root root 64 Mar 23 10:41 14 -> pipe:[6200340]

4) Now find the process(es) at the other end of the pipe identified by inode 6200340. This script - invoked with the inode as argument - is useful for the purpose:

#!/bin/bash

for i in /proc/*/fd; do 
    found=$(ls -l $i| fgrep $1)
    if [[ x$found != x ]]; then
    pid=$(basename $(dirname $i))
    name=$(ps -p $pid -o comm=)
    echo "$name ($pid)"
    fi
done