How to figure out an EC2 instance that became unresponsive three times in a day

amazon ec2

My EC2 instance (t2.small) stopped accepting connections on SSH or its other service, but the EC2 control panel said the automatic status checks had not failed, even after several hours. I could not reboot it using the control panel, but I could stop it and start it again. It became unresponsive twice more that day.

After that, I configured cgroups to limit the CPU and memory usage of a mildly resource-hogging process, but that doesn't seem like the right answer. That shouldn't have brought the machine to a halt. (It has no swap, but the OOM killer should have simply killed a process if the instance was out of memory.) "Get System Log" and "Get Instance Screenshot" showed nothing suspicious. The server is running some fairly trusted software, like postfix and gitolite, plus an in-development server which was running as a user. When I look at the CPU usage graph, it shows 2.5% during that time, with occasional spikes to around 6%.

What can I do to figure this out and prevent it from happening again? All I can think is that it was a hardware issue, but I thought that would have been very unlikely.

Best Answer

I ran into a similar issue and eventually found a solution in looking at the CPU credit usage of my t series instance. By default t series instances run on CPU credits, and once they are exhausted the instance becomes unresponsive until more credits are awarded.

For me, unresponsive meant: unable to access any services and websites provided by the instance, including unresponsive to ssh connection attempts. Attempts to reboot the instance failed to revive it, and only stopping the instance and starting it again would recover it.

I don’t know the semantics of the CPU credit system regarding power off/on events, but I assume that a power cycle must itself come with a credit award.

The solution for preventing a t2 or t3 instance from becoming unavailable in this way is one of:

  • enable t2 unlimited (or t3 unlimited), possibly for an extra cost
  • upgrade to a fixed allocation instance (m series), in my case this would have doubled the cost of the VM as the upgrade path was from a t2.medium ($0.045/hr) to an m5d.large ($0.113/hr).
  • re-deploy to a new t series instance, which comes with a new allocation of credits. Cost is subtle here, booting a fresh VM comes with some cost but I don’t know what it is.

t2/t3 unlimited means that once a server has exceeded usage of its CPU credits, additional charges apply for further CPU usage until more credits are awarded. Currently excess usage is calculated over a 24 hour period.

In our specific case, our t2 Jenkins server was becoming unresponsive sometimes on somedays. It’s likely that source code commits on those days were high frequency, causing CI to run more frequently, and burn too many credits. We switched to the new t3, and enabled T3 Unlimited. It’s only been a few days, but the problem has not resurfaced yet. Being a CI server it is unlikely to run much outside of the general working hours and I have strong doubts that the cost will ever actually be greater.

Hopefully this post mortem will help anyone hitting the same snag.