Timeout errors from nagios / SNMP

nagios

Am monitoring ~100 remote hosts via a VPN using check_snmp_process.pl. For many months this has worked just fine. Over the weekend I started seeing ERROR: Alarm signal (Nagios time-out) errors from just about every host/process. I can use the command on the command line and get a successful response so I'm not clear why it would timeout under normal usage.

This morning I tried upping the 'timeout' param on the plugin to 20 seconds. For about an hour this appeared to work then in a matter of minutes the failure rate returned to its previous level.

The VPN server doesn't appear to be under any abnormal load. Nor does the nagios machine.

Suggestions on where else to look for the source of this?

Nagios machine: CentOS 6.5
Nagios version: 3.5.1
Plugin version: 1.10


EDIT: When the 'mass timeout' happens it's all within a few seconds. Every host shows the same time (+- 5 seconds) on the report. This may be due to nagios forcing rechecks on 'orphaned processes' from a restart of the service. Not sure yet. Just seems ominous when 40-50 timeouts hit the log all at once.

Best Answer

I had the same issue but after editing the script check_snmp_process.pl time out from 15 to 40 it worked. my $TIMEOUT = 40;