Without making any changes to nagios3 config or OS (debian) filesystem changes when I add some extra devices (to the 12000+ on it already) suddenly
[1508925621] Warning: Return code of 127 for check of service 'PING' on host 'SOME-HOST.CISCO' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1508925621] SERVICE ALERT: SOME-HOST.CISCO;PING;CRITICAL;HARD;3;(Return code of 127 is out of bounds - plugin may be missing)
All the binaries are readable/executable none of that has changed since setup.
It happens for ALL hosts of that type, bear in mind this is a setup that's worked for years non-stop the only thing I can think of is some kind of OS limit is hit when running the checks as that's the only thing that changes, more hosts.
I've had max_concurrent_checks=1500
for a long time. (Its a 16 core 24GB RAM physical server)
Apart from the concurrent checks I run
check_result_reaper_frequency=25
max_check_result_reaper_time=20
The large group of hosts are configured as such:
define host{
use generic-cisco
host_name SOME_HOST.CISCO
alias SOME_HOST.CISCO
address xxx.xxx.xxx.xxx
check_command check-host-alive
hostgroups cisco_devices
}
define service{
use generic-service
host_name SOME_HOST.CISCO
service_description PING
check_command check_ping!200.0,20%!600.0,60%
normal_check_interval 10
retry_check_interval 5
}
The only thing to make return it to a working state is to take off some of the more recent hosts I've added and stop and start and hope it runs fine. Any suggestions?
Best Answer
What fixed it was although I had many other performance recommendations followed I hadn't disabled
enable_environment_macros
Not a dent in performance now. Apparently the problem was the OS was struggling with making those environment vars available at that amount of hosts.. Found through hereI like a good nagios facepalm.