Shell script to notify when CPU usage goes to 100%

monitoringsarunix

sar -u 1 | awk '{print $9}'

so this will give me "CPU Idle" value every second. I'd like to get email in this case the value goes to "0" 10 times in a row?

What would be the appropriate way to do it?

I found a preliminary solution

sar -u 1 | awk '{ if (int($9)==0) { 
                 i=i+1; {
                           print i, $9
                         }
                  }
       if (int($9)>=0) {
                  i=0
               }
               if (i>=10) print "sending email"
            }'

but in last line where I print "sending email" I can't put call to mutt, like this

sar -u 1 | awk '{ if (int($9)==0) { 
                 i=i+1; {
                           print i, $9
                         }
                  }
       if (int($9)>=0) {
                  i=0
               }
               if (i>=10) mutt -s "VPNC Problem" test@test.com < /home/semenov/strace.output
            }'

the problem is that it says "syntax" error in mutt command call. Any ideas?

Best Answer

The appropriate way to do it is to NOT do it.

CPU Utilization (either %used or %idle) is a bogus value to monitor - it can (and SHOULD) be 100% at various times during normal operation. Do you really want a bunch of alerts because you happened to get 5-10 web requests at the same time your monitoring system checked CPU utilization? I'm betting the answer is no.

Instead you should monitor Load Average (reported by uptime among other tools), which is a measure of the number of processes which want to run right now (the length of RunQ in OS scheduling terms).
The value is usually reported as three values, 1-minute load average ("now"), 5-minute load average, and 15-minute load average.


Load averages below 1 indicate an "unloaded" system (lots of free CPU time, no programs waiting around to execute).
High load averages ("high" being relative to the number of CPUs you have and your system's interactive performance under load) are a cause for concern, and should be investigated.

I typically use 10 as my threshold for load average alarms -- a value high enough that you shouldn't typically see it in production, but low enough that you should have time to respond to the situation once the alarm trips.


The script to monitor in either case is trivial:

# [get your value and stuff it into $value
# Pick an appropriate threshold and stuff it into $threshold
if [ $value -gt $threshold ]; then  # (-gt or -lt as appropriate)
    echo "`hostname` needs attention!" | \
         mail -s "`hostname` monitoring alert" user@host
fi

The getting-and-stuffing part is left as an exercise for the reader.
If you really want to Do It Right you should investigate some monitoring systems and SNMP...