Linux – Nagios (Return code of 141 is out of bounds) on random services

debianlinuxmonitoringnagiosopenvz

We have had Nagios running on one of our servers with out any problems for a while but lately we get (Return code of 141 is out of bounds).

The load on the server raised because we gone online with our service, but its still not really high (load average max: 0.7). Before the launch everything in Nagios works fine.

See on the image, Current Load returns code 141. 2 Minutes ago the Beancounters VZ has returned 141. This happens irregular. Only HTTP & PING does not return 141, they don't relay on nrpe.

http://pic-hoster.net/view/45030/ScreenShot2012-05-28at5.31.35PM.png

I noticed that if i execute the command from my Nagios host against the problematic client, sometimes the return get lost:

root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
APT OK: 0 packages available for upgrade (0 critical updates). 
root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
root@xxx23:/usr/local/nagios/libexec# ./check_nrpe -H 123.123.123.123 -c check_apt 
APT OK: 0 packages available for upgrade (0 critical updates).

This do not happen if i execute it directly on the client.

What i have done:

I increased the OpenVZ memory and CPUUnit for this container.
I updated to latest Nagios 3.4.1 (from source)
I executed the Nagios checks localy through nrpe – never got 141 back or something

I had the same issue some month ago with an other server. Haven't found the problem and reinstalled the server. Works now.

Someone with an idea?

UPDATE

I think i found it, has not happened for an hour.

SIGPIPE was a good tip, i assumed something with the system not with nagios.

I tuned the openvz configuration and limits. I will report back, if it stays stable.

Best Answer

We had a similar problem where one service checked via NRPE in a container returned an expected WARNING, then after some minutes the same service returned CRITICAL with the 141/SIGPIPE error. On the next check it returned WARNING then CRITICAL, then WARNING and so on.

I performed a traffic capture for the error and found Nagios issue #305 to quite precisely describe what I had observed. It seems to be caused by an unclean connection close on the NRPE server side while using SSL (SSL_shutdown()) which makes it send a TCP RST to the client which causes an aborted read and thus the SIGPIPE.

Applying the patch nrpe-ssl_shutdown-2.patch attached to the issue report to the NRPE source, rebuilding and reinstalling/restarting it seemed to stop the problem from repeating, and warnings are now reported normally without critical errors.

Related Solutions

Nagios no data returned from plugin, check_apt only

The problem I discovered was in the service description, although I think there is either a bug or an option in Nagios not being specified. The debug output showed the actual command line being run to NRPE

/usr/lib/nagios/plugins/check_nrpe -H server.mechsoft-vps1.com -c check_mysqld -a

The problem here is that -a requires a parameter. The check however does not. Changing the service definition to add a parameter fixed the problem.

define service{
    use                             generic-service         
    host_name                       development
    service_description             APT
    check_command                   check_nrpe!check_apt!1
    }

Powershell – Passing arguments to a Powershell Script using Nagios NRPE

Arguments can be allowed on multiple levels (depending on how you want to slice your "security"). In essence this means you can allow arguments at the NRPE level as well as the external scripts level (and in your case you probably want them in both places)

You can find some background details here: http://docs.nsclient.org/howto/external_scripts.html#arguments

But disregarding the theory to answer your question you need to enable allow arguments in TWO places (see the following):

[/settings/NRPE/server]
allow arguments=true

[/settings/external scripts]
allow arguments=true

[/settings/external scripts/scripts]
foo=scripts\\foo.bat "argument 1" "argument 2"

The one your missing is the latter one:

[/settings/external scripts]
allow arguments=true

So adding that would resolve your issue.

Edit: Add information about the second issue.

The secondary problem with the power shell launch (see comment to this post) was related to powershell oddness which requires a rather intricate command line syntax:

[/settings/external scripts/scripts]
test = cmd /c echo scripts\test.ps1 "$ARG1$"; exit($lastexitcode) | powershell.exe -command -

The problem was the missing - at the end of the command (correct command above).

Best Answer

Related Solutions

Nagios no data returned from plugin, check_apt only

Powershell – Passing arguments to a Powershell Script using Nagios NRPE

Related Topic