High load on a nagios server — How many service checks for a nagios server is too many

hardwarehigh-loadnagios

I have a nagios server running Ubuntu with a 2.0 GHz Intel Processor, a RAID10 array, and 400 MB of RAM. It monitors a total of 42 services across 8 hosts, most of which are checked using the check_http plugin even 5 minutes, some every minute. Recently the load on the nagios server has been above 4, often as high as 6. The server also runs cacti, gathering statistics every minute for 6 hosts.

I wonder, how many services should hardware like this be able to handle? Is the load so high because I am pushing the limits of the hardware, or should this hardware be able to handle 42 service checks plus cacti? If the hardware is inadequate, should I look to add more RAM, more cores, or faster cores? What hardware / service checks are others running?

Best Answer

You need to figure out where your bottleneck is...

I run a nagios monitor that checks 400+ hosts with http, ping and ssh checks. (along with a lot of other passive checks and nscd)

This is on a 2xQuadCore server with 4 SAS disks in RAID10.

I suspect you're having IO contention, as writing to lots of rrds is very inefficient.

You need to figure out which process is taking up your resources. (cacti, nagios or something else)

For IO checking, I like iotop. Install iotop (the 9.04 package works on 8.04)

But otherwise top should also help you find your load hog.

Cacti once a minute is pretty aggressive. (I run mine at 5m intervals)

One approach I've heard of for rrd write contention is to put your rrd stores on a ramdisk/tmpfs. (be sure to rsync that every now and then to persistent storage)

Good luck.