Amazon EC2 – How to Fix High CPU% Stolen at Regular Intervals

amazon ec2central-processing-unitvirtual-machines

I have a m1.small EC2 instance on AWS running some websites. I noticed my CPU usage have peaks at regular interval of times, exactly every 30 minutes (0:06, 0:36, 1:06, …).

I've checked my crons (I have many), bot no one runs every 30 minutes. Looking at top I noticed that peaks are about about 1 minute long, and are almost entirely made of "stolen CPU" (%st). I've read that it's CPU time stolen by the Amazon VM hypervisor, but I can't understand why it happens (I'm not running CPU intensive stuffs when this occurs) and why it's exactly every 30 minutes.

Do you have any clue? Should I buy a bigger instance? I hope not, because the rest of the time CPU is very low and load average never goes over 0.5…

Cacti CPU graph

Best Answer

Depending on the EC2 instance type and the underlying hardware, you may not be paying for access to all of the underlying CPU cycles. Amazon is not going to give you access to 100% of a modern, fast CPU if you have asked for an m1.small which is promised to be equivalent to an old, slow CPU.

On EC2, steal doesn't depend on the activity of other virtual machine neighbors. It is simply a matter of EC2 making sure you are not getting more CPU cycles than you are paying for.

If your m1.small gets 50% of the underlying faster CPU, then for every bit of CPU you are using, you will see another equal percentage flagged as steal.

It would be nice if EC2 let you think your true available CPU was "100%" instead of teasing you with the rest of the CPU that you don't have access to, and then telling you that you can't have it when you try to use the CPU, but that's the way it works given the current VM and host setup.

m1.small instances are likely to show a high percentage of steal given the limited CPU they have access to for the price compared to the CPU speeds on the underlying hardware.

If you are concerned that this particular instance might have something broken on EC2's side, you could stop/start it to move it to new hardware (my article on this) and see if that makes a difference. Of course, if the steal percentages drops, it might just indicate that you have moved to a slower hardware CPU.

As to the activity every 30 minutes, that is software on your server. It could be a system cron job or it could be triggered by a daemon (background process).