Why is the AWS instances suddenly becoming irresponsive reporting high “stolen” CPU

amazon ec2amazon-web-servicescpu-usagescaling

The setup
I have a bunch of t2.small EC2 instances running hosting the image processing library called thumbor for simple on-the-fly image resizing. Originals are loaded from S3. In front of the instances I have an EC load balancer. I have New Relic server monitoring installed in the servers.

The problem
At random times, my servers suddenly start to experience extremely high avg. response times. If I look at the stats in New Relic, the only thing I see, is that the servers CPU spikes out consistently reporting "stolen" CPU.

My servers seems to have high enough capacity and it's NOT like there are any extreme spikes in throughput meanwhile.

I have noticed, that if I stop/start the servers again. Then the Stolen CPU disappears, and they run fine again – until next time – it could hours or days between.

Why is this happening, and what can I do about?

Best Answer

The t-series of instances at Amazon use a quota system for CPU usage. When you reach your quota, you start seeing your stolen percentages rise. There isn't much you can do about that, it's structural to the offering.

Use less CPU overall.
Use a larger t-series instance.
Use one of the m-series or c-series, which doesn't have a quota.

Best Answer

Related Solutions

CPU or network I/O bound

How to look for when monitoring an Amazon EC2 Micro Instance’s CPU usage

Related Topic