Ssh – Cause of flapping UNKNOWN Nagios status

monitoringnagiosredhatssh

We run some Nagios service checks via OpsView, and one of our hosts is getting a strange response for SSH:

"UNKNOWN: Service results are stale"

It happens regularly, but seems to go away as the system retries a 2nd and 3rd time. It started after a patch and reboot of the server in question last week. The system itself responds to SSH from boxes I've tested with (which doesn't include the monitoring system I am not given access to).

/var/log/secure is full of lines ala:

sshd[15628]: Did not receive identification string from xxx.xxx.226.20

Time stamps are reliably every five minutes, which is pretty obviously the monitoring script disconnecting once it gets a login prompt.

Anyone know what might be causing this, or how to fix it? It's really frustrating to see this pop on and off the status page.

Best Answer

"Did not receive identification string" is what you'll get from sshd any time someone connects then disconnects without attempting the SSH handshake (which is what the Nagios SSH check does) -- so that's nothing to worry about.

Now, as to why you're getting "stale results", well that looks like you're using passive checks, which wouldn't be my first choice for an SSH check. However, perhaps OpsView integration demands it... At any rate, a prematurely stale check result means that you're not sending passive check results often enough for Nagios' liking, so either you need to tell whatever's feeding the check results to do it more often, or tell Nagios to be less picky about how often it gets check results (set freshness_threshold to some value larger than it is -- or to something greater than 300 (seconds; so 5 minutes) if it isn't already defined).