SNMPD running but not listening for connections at random

OS: CentOS release 5.7 (Final)
Net-SNMP: net-snmp-5.3.2.2-14.el5_7.1 (from RPM)

Periodically my NMS notifies me that SNMP has gone down on this machine. The service is restored in between 10 to 30 minutes. My NMS also pings and check SSH and those services are not affected during the SNMP outage.

SNMPD log file shows that it is working and apparently receiving packets (either from local agents from 127.0.0.1 or from my NMS at 172.16.37.37) however attempting to snmpwalk locally or from the NMS system fails with a timeout.

I have 7 of these servers running mixture of CentOS 5.7 and RHEL 5.7 with this specific version of Net-SNMP installed from RPM – none of them have this issue except this one. 5 of the machines (including the NMS system and this problem server) are in the same rack connected using one switch.

Restarting SNMPD does not fix the issue – it clears up by itself eventually. Any suggestions where I can begin diagnosing the issue? It's a closed subnet so IPTables is not used. SNMPD config below:

# Following entries were added by HP Insight Management Agents at
#      Tue May 15 10:58:17 CLT 2012
dlmod cmaX /usr/lib64/libcmaX64.so
rwcommunity public 127.0.0.1
rocommunity public 127.0.0.1
rwcommunity 3adRabRu 172.16.37.37
rocommunity 3adRabRu 172.16.37.37
rwcommunity 3adRabRu 172.16.37.36
rocommunity 3adRabRu 172.16.37.36
trapcommunity callmetraps
trapsink 172.16.37.37 callmetraps
trapsink 172.16.37.36 callmetraps
syscontact Lukasz Piwowarek
syslocation Santiago, Chile
# ---------------------- END --------------------
agentAddress udp:161
com2sec rwlocal default public
com2sec rolocal default public
com2sec subnet  default 3adRabRu
group   rwv2c   v2c             rwlocal
group   rov2c   v2c             rolocal
group   rov2c   v2c             subnet
view    all     included        .1
access  rwv2c   ""      any             noauth          exact   all     all     none
access  rov2c   ""      any             noauth          exact   all     none    none

Best Answer

There are a few issues to address on this one.

Looking at your config, I see OpenNMS as the monitoring solution, HP ProLiant server hardware, possible package version and driver issues, and a couple of tweaks you could possibly make to your snmpd options.

Are you on the most recent version of OpenNMS? The current revision is 1.10.3 Is the machine you're polling the NMS system or unrelated? Was this a problem with an older version of OpenNMS, or is this a new installation?

I also see a module for the HP ProLiant Management Agents loaded in the first line of your snmpd.conf config. That feeds the ProLiant Support Pack and HP health agents. Is this the only HP server you're monitoring? To test the HP snmp config, can you access the System Management Homepage at https://server.ip:2381 ? Do the system sensors (temperature, storage, ILO) show up properly? If they don't, there's a problem with your SNMP setup.

On the OpenNMS side, there are incredibly flexible logging options available for the poller. We can help you get the info you need, but I don't think this is a general OpenNMS problem if it's only affecting one node. You could remove the node from the database and rediscover it to test this theory.

For the host in question, you may want to edit /etc/sysconfig/snmpd.options to reduce log verbosity in case that's an issue.

My guess is that it's an OpenNMS polling/DB issue or that it's the interaction of the HP agents and snmp on the single problem system.

Best Answer

Related Solutions

Snmpd dead but subsys locked

SNMPD not binding correctly

Related Topic