Nagios/nrpe giving “Return code of 255 is out of bounds”

nagiosnrpe

I have the following service set up for nagios:

define service {
  hostgroup_name             LNX
  service_description        /tmp Disk Usage
  check_command              check_nrpe!check_disk!-a '-w 20% -c 10% -p /tmp'
  check_interval             1
  max_check_attempts         3
  retry_interval             1
  check_period               24x7
  notification_interval      2
  notification_period        24x7
  notification_options       c,r,w
  notifications_enabled      0
  contact_groups             devops
}

Which ties to the following command:

define command {
 command_name     check_nrpe
 command_line     $USER1$/check_nrpe -H $HOSTADDRESS$ -u -t 60 -c $ARG1$ $ARG2$
}

So in the end what's being executed (and its output when run on command line) is:

$: /usr/local/nagios/libexec/check_nrpe -H <my host> -u -t 60 -c check_disk -a '-w 20% -c 10% -p /tmp'
DISK OK - free space: /tmp 4785 MB (97% inode=99%);| /tmp=124MB;3928;4419;0;4910

Following this with echo $? yields a 0, meaning OK/success.

However, nagios is reporting this as "error code 255 out of bounds" and I'm not sure why.

Running the check_disk command on the server works fine:

$: ./check_disk -w 20% -c 10% -p /tmp
DISK OK - free space: /tmp 4785 MB (97% inode=99%);| /tmp=124MB;3928;4419;0;4910
$: echo $?
0

And as shown above, it works when done through the check_nrpe executable on the nagios server. This means:

  1. The command (check_disk) is present on the remote system:
    command[check_disk]=/usr/local/nagios/libexec/check_disk $ARG1$
  2. The nagios server is able to talk to the remote nrpe (e.g. it can access it on the network and its IP is present in the only_from directive in /etc/xinetd.d/nrpe)

Additionally, this check runs fine on other machines, but not all machines

Why does Nagios think it's getting a 255 when everything I can see means it should be getting 0 and thus marking the service as OK?

EDIT: Nagios version is Nagios core 4 running on CentOS 7, hosts being checked are centos 5-7, the problem appears on multiple machines of varying versions

Best Answer

When you have check_command as follow:

check_command check_nrpe!check_disk

The command name tied is actually check_disk instead of check_nrpe at client side.

Cause of problem

The service setting in Nagios server request the monitored client to execute check_disk command with ONE arguments.

-w 20% -c 10% -p /tmp

Your current setting for check_disk command with on Nagios client at nrpe.cfg is as shown:

command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$

So the command you passed to monitored client via NRPE is actually:

/usr/lib64/nagios/plugins/check_disk -w -w 20% -c 10% -p /tmp -c $ARG2$ -p $ARG3$

Therefore, the test is failed because the command cannot be successfully executed.

Solution

If you want to pass 3 different arguments to Nagios client, try to modify your check_command as follow:

check_command check_nrpe!check_disk -a '-w 20% -c 10% -p /tmp'

Make sure you have the corresponding command configured at Nagios client:

command[check_disk]=/usr/lib64/nagios/plugins/check_disk $ARG1$

Another option would be changing the server configuration as follow:

check_command check_nrpe!check_disk

With corresponding client configuration:

command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /tmp