Nagios load spike every 7 hours

nagios

I have a NagiosXi server monitoring 631 services on 63 hosts. Every seven hours the load on the server spikes up to 20ish and then gradually falls back to near-0.

There are no cron jobs running every 7 hours.

The server has 8 cores and 2GB RAM. The RAM is not an issue, it still sits at 1GB free during the spikes, and upping it to 4GB makes no difference. The server was also migrated to a new host a week or so ago with no changes.

We also have scheduled downtime on 17 of the hosts being monitored so they are only monitored during 6am-6pm Mon-Fri, this seems to make no difference to the load spikes.

Most checks are done on Windows servers, using check_wmi_plus.

During load spikes, I tend to see 5-8 instances of check_wmi_plus.pl using 2-3% cpu, and a handful of httpd processes using the same, but nothing stands out as using a lot of cpu. Those processes also roll over quite fast so they are not hung or taking an unusual long period of time. The Service Check Execution Time in NagiosXi Performance Monitor tends to peak at ~5.5s with averages around 1s.

Can anyone suggest a possible cause, or how I can further troubleshoot this?

Best Answer

A high load does NOT necessarily mean that you are using high levels of CPU only it only provides the number of process at a snapshot in time that are ready to run and receive CPU time but not how much of it.

Nagios does spin off a lot of processes rapidly depending on how you have set its monitoring schedules and at times will cause a spike as it starts a lot of processes running as fast as possible, but they might not require very much CPU or go immediately into a sleep/wait state.

BTW, if you disable NOTIFICATIONS in Nagios, this does not stop it from continuing to monitor a given host or service.

Related Topic