OS: CentOS release 5.7 (Final)
Net-SNMP: net-snmp-5.3.2.2-14.el5_7.1 (from RPM)
Periodically my NMS notifies me that SNMP has gone down on this machine. The service is restored in between 10 to 30 minutes. My NMS also pings and check SSH and those services are not affected during the SNMP outage.
SNMPD log file shows that it is working and apparently receiving packets (either from local agents from 127.0.0.1 or from my NMS at 172.16.37.37) however attempting to snmpwalk locally or from the NMS system fails with a timeout.
I have 7 of these servers running mixture of CentOS 5.7 and RHEL 5.7 with this specific version of Net-SNMP installed from RPM – none of them have this issue except this one. 5 of the machines (including the NMS system and this problem server) are in the same rack connected using one switch.
Restarting SNMPD does not fix the issue – it clears up by itself eventually. Any suggestions where I can begin diagnosing the issue? It's a closed subnet so IPTables is not used. SNMPD config below:
# Following entries were added by HP Insight Management Agents at
# Tue May 15 10:58:17 CLT 2012
dlmod cmaX /usr/lib64/libcmaX64.so
rwcommunity public 127.0.0.1
rocommunity public 127.0.0.1
rwcommunity 3adRabRu 172.16.37.37
rocommunity 3adRabRu 172.16.37.37
rwcommunity 3adRabRu 172.16.37.36
rocommunity 3adRabRu 172.16.37.36
trapcommunity callmetraps
trapsink 172.16.37.37 callmetraps
trapsink 172.16.37.36 callmetraps
syscontact Lukasz Piwowarek
syslocation Santiago, Chile
# ---------------------- END --------------------
agentAddress udp:161
com2sec rwlocal default public
com2sec rolocal default public
com2sec subnet default 3adRabRu
group rwv2c v2c rwlocal
group rov2c v2c rolocal
group rov2c v2c subnet
view all included .1
access rwv2c "" any noauth exact all all none
access rov2c "" any noauth exact all none none
Best Answer
There are a few issues to address on this one.
Looking at your config, I see OpenNMS as the monitoring solution, HP ProLiant server hardware, possible package version and driver issues, and a couple of tweaks you could possibly make to your snmpd options.
Are you on the most recent version of OpenNMS? The current revision is 1.10.3 Is the machine you're polling the NMS system or unrelated? Was this a problem with an older version of OpenNMS, or is this a new installation?
I also see a module for the HP ProLiant Management Agents loaded in the first line of your
snmpd.conf
config. That feeds the ProLiant Support Pack and HP health agents. Is this the only HP server you're monitoring? To test the HP snmp config, can you access the System Management Homepage at https://server.ip:2381 ? Do the system sensors (temperature, storage, ILO) show up properly? If they don't, there's a problem with your SNMP setup.On the OpenNMS side, there are incredibly flexible logging options available for the poller. We can help you get the info you need, but I don't think this is a general OpenNMS problem if it's only affecting one node. You could remove the node from the database and rediscover it to test this theory.
For the host in question, you may want to edit
/etc/sysconfig/snmpd.options
to reduce log verbosity in case that's an issue.My guess is that it's an OpenNMS polling/DB issue or that it's the interaction of the HP agents and snmp on the single problem system.