Why is the AWS instances suddenly becoming irresponsive reporting high “stolen” CPU

amazon ec2amazon-web-servicescpu-usagescaling

The setup
I have a bunch of t2.small EC2 instances running hosting the image processing library called thumbor for simple on-the-fly image resizing. Originals are loaded from S3. In front of the instances I have an EC load balancer. I have New Relic server monitoring installed in the servers.

The problem
At random times, my servers suddenly start to experience extremely high avg. response times. If I look at the stats in New Relic, the only thing I see, is that the servers CPU spikes out consistently reporting "stolen" CPU.

My servers seems to have high enough capacity and it's NOT like there are any extreme spikes in throughput meanwhile.

I have noticed, that if I stop/start the servers again. Then the Stolen CPU disappears, and they run fine again – until next time – it could hours or days between.

Why is this happening, and what can I do about?

New relic Server monitoring reporting sudden high Stolen CPU

EC load balancer reporting high response time but no significant increase in throughput

Best Answer

The t-series of instances at Amazon use a quota system for CPU usage. When you reach your quota, you start seeing your stolen percentages rise. There isn't much you can do about that, it's structural to the offering.

  • Use less CPU overall.
  • Use a larger t-series instance.
  • Use one of the m-series or c-series, which doesn't have a quota.
Related Topic