Linux – Nagios (Return code of 141 is out of bounds) on random services

debianlinuxmonitoringnagiosopenvz

We have had Nagios running on one of our servers with out any problems for a while but lately we get (Return code of 141 is out of bounds).

The load on the server raised because we gone online with our service, but its still not really high (load average max: 0.7). Before the launch everything in Nagios works fine.

See on the image, Current Load returns code 141. 2 Minutes ago the Beancounters VZ has returned 141. This happens irregular. Only HTTP & PING does not return 141, they don't relay on nrpe.

http://pic-hoster.net/view/45030/ScreenShot2012-05-28at5.31.35PM.png

I noticed that if i execute the command from my Nagios host against the problematic client, sometimes the return get lost:

root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
APT OK: 0 packages available for upgrade (0 critical updates). 
root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
APT OK: 0 packages available for upgrade (0 critical updates). 

This do not happen if i execute it directly on the client.

What i have done:

  • I increased the OpenVZ memory and CPUUnit for this container.
  • I updated to latest Nagios 3.4.1 (from source)
  • I executed the Nagios checks localy through nrpe – never got 141 back or something

I had the same issue some month ago with an other server. Haven't found the problem and reinstalled the server. Works now.

Someone with an idea?

UPDATE

I think i found it, has not happened for an hour.

SIGPIPE was a good tip, i assumed something with the system not with nagios.

I tuned the openvz configuration and limits. I will report back, if it stays stable.

Best Answer

We had a similar problem where one service checked via NRPE in a container returned an expected WARNING, then after some minutes the same service returned CRITICAL with the 141/SIGPIPE error. On the next check it returned WARNING then CRITICAL, then WARNING and so on.

I performed a traffic capture for the error and found Nagios issue #305 to quite precisely describe what I had observed. It seems to be caused by an unclean connection close on the NRPE server side while using SSL (SSL_shutdown()) which makes it send a TCP RST to the client which causes an aborted read and thus the SIGPIPE.

Applying the patch nrpe-ssl_shutdown-2.patch attached to the issue report to the NRPE source, rebuilding and reinstalling/restarting it seemed to stop the problem from repeating, and warnings are now reported normally without critical errors.

Related Topic