Linux – How to diagnose and fix frequent 100% cpu utilization from kernel

amazon ec2central-processing-unitkernellinux

I have an Amazon EC2 micro instance running an old 2.6.16 kernel. It runs postfix, apache, and mysql. During normal loads, it's load average is around 0.05, and it runs this way 95% of the time or so. However, a few times a day (or so), the CPU usage will spike to 100% and the system becomes nearly unusable. This usually lasts for roughly 5 minutes, then the load returns to normal.

If I manage to take a look at htop while this happens (not easy — the load is that severe), I see that no running task accounts for any significant cpu usage, leading me to believe this must all be taking place in kernel-land.

How can I diagnose the cause of this load and, more importantly, fix it?

Best Answer

What is the percentage of "iowait" and "steal" CPU time during these periods?

Iowait denotes the amount of time the CPU is spending waiting for IO requests to complete, and steal percentage denotes CPU time that your kernel requested, but was denied by the hypervisor.

EC2 t1.micro instances are very CPU and IO-constrained. They can burst for very short amounts of time, after which they're subject to severe CPU throttling. Next time this happens, pay attention to %wa and %st in the output of top. My bet is that one or both of these have high percentages of CPU time.

To mitigate, you'll need to find the source of the IO and/or CPU load or alternatively, resize your instance to an m1.small.