High load on a nagios server — How many service checks for a nagios server is too many

hardwarehigh-loadnagios

I have a nagios server running Ubuntu with a 2.0 GHz Intel Processor, a RAID10 array, and 400 MB of RAM. It monitors a total of 42 services across 8 hosts, most of which are checked using the check_http plugin even 5 minutes, some every minute. Recently the load on the nagios server has been above 4, often as high as 6. The server also runs cacti, gathering statistics every minute for 6 hosts.

I wonder, how many services should hardware like this be able to handle? Is the load so high because I am pushing the limits of the hardware, or should this hardware be able to handle 42 service checks plus cacti? If the hardware is inadequate, should I look to add more RAM, more cores, or faster cores? What hardware / service checks are others running?

Best Answer

You need to figure out where your bottleneck is...

I run a nagios monitor that checks 400+ hosts with http, ping and ssh checks. (along with a lot of other passive checks and nscd)

This is on a 2xQuadCore server with 4 SAS disks in RAID10.

I suspect you're having IO contention, as writing to lots of rrds is very inefficient.

You need to figure out which process is taking up your resources. (cacti, nagios or something else)

For IO checking, I like iotop. Install iotop (the 9.04 package works on 8.04)

But otherwise top should also help you find your load hog.

Cacti once a minute is pretty aggressive. (I run mine at 5m intervals)

One approach I've heard of for rrd write contention is to put your rrd stores on a ramdisk/tmpfs. (be sure to rsync that every now and then to persistent storage)

Good luck.

Related Solutions

Nagios – How to Dynamically Set a New Test Interval for Nagios Checks

You can do it by using CHANGE_NORMAL_SVC_CHECK_INTERVAL and CHANGE_NORMAL_HOST_CHECK_INTERVAL.

Add an event handler for your service:

define service {
    host_name              ...
    service_description    ...
    check_command          ...
    contact_groups         ...
    event_handler          change_check_interval
}

The change_check_interval was defined in commands.cfg:

define command {
    command_name    change_check_interval
    command_line    $USER1$/eventhandlers/change_check_interval.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$
}

The content of change_check_interval.sh:

#!/bin/bash

now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'

case "$1" in
    OK)
        ;;
    WARNING)
        ;;
    UNKNOWN)
        ;;
    CRITICAL)
        /bin/printf "[%lu] CHANGE_NORMAL_SVC_CHECK_INTERVAL;host1;service1;2\n" $now > $commandfile
        ;;
esac

exit 0

Make sure that external commands is enabled in nagios.cfg:

check_external_commands=1

How to check if a service (that listens on given port) is up & working

Keeping to the standard Nagios plugins found on, say, an Ubuntu repository, you can use the check_tcp plugin to send a string, and then check to see if it returns the expected response:

Usage:check_tcp -H host -p port [-w <warning time>] [-c <critical time>] [-s <send string>]
[-e <expect string>] [-q <quit string>][-m <maximum bytes>] [-d <delay>]
[-t <timeout seconds>] [-r <refuse state>] [-M <mismatch state>] [-v] [-4|-6] [-j]
[-D <days to cert expiry>] [-S <use SSL>] [-E]

Since you can modify your service, you can do something like "Are you OK?" and look for "I'm OK". It depends on how involved you want to get with checking to see if the service is up and running.

You can also use check_procs to see if the process for the service is there. This might be in conjunction with a check_tcp check, or as an alternative. Again, it depends on what you're doing, and how much you actually want to do. If you want to get very involved, you can write a custom Nagios check that will do all sorts of things to verify the functionality of the service and return custom state messages to the Nagios server.

Best Answer

Related Solutions

Nagios – How to Dynamically Set a New Test Interval for Nagios Checks

How to check if a service (that listens on given port) is up & working

Related Topic