Linux – Debug sudden load peaks

high-loadlinux

I need to debug sudden load peaks automatically. We already monitor with nagios like check scripts, but the load peaks are seldom and short.

I search for a daemon which checks the load every N seconds and if there is trouble, reports something like ps aux --forest (and iotop –batch)

Graphs created with e.g. munin don't help here, since I need to identify the processes which cause the load.

Best Answer

Amongst many possibilities for local process monitoring (choose your poison) is monit, I do something like this in /etc/monit.d/system.conf on centos machines;

check system localhost
    if loadavg (1min) > 6 then alert
    if loadavg (5min) > 6 then alert
    if memory usage > 90% then alert
    if cpu usage (user) > 90% then alert
    if cpu usage (system) > 75% then alert
    if cpu usage (wait) > 75% then alert

I imagine that you might want to be more aggressive with the checks, hence you might want to set the daemon to run checks more often, maybe every 30 seconds until you have determined the problem, and hence would use a /etc/monit.conf something like this;

set daemon  30
set mailserver localhost
#set alert user@gmail.com but not on { instance }
set alert user@gmail.com
include /etc/monit.d/*
set httpd port 2812
        allow 127.0.0.1

If monit does not provide enough information in the default mail alert, then you can have monit execute custom scripts on alert conditions like so;

check system localhost
    if loadavg (1min) > 6 then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"
    if loadavg (5min) > 6 then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"
    if cpu usage (user) > 90%  then exec "/bin/bash -c '/usr/bin/top -n1 -b  | /bin/mail -s top-output userXXX@gmail.com'"

(obviously relies on mail command being setup, but you can use local root instead and just check it manually)

Best Answer

Related Solutions

linux – Resolving Sudden Peaks in Load and Disk Block Wait

Linux – Can someone explain the “use-cases” for the default munin graphs

Related Topic