Why does Amazon EC2 status check succeed for unresponsive instance

amazon ec2amazon-cloudwatchamazon-web-services

DANGER!

Do not run this command to 'test' it unless you are prepared for a crash and/or force-rebooting your system.

The steps I took:

  • I created a t1.micro instance on EC2 running Ubuntu 14.01 LTS.
  • I verified that both status checks passed.
  • I SSH'd into the instance.
  • I ran the fork bomb documented in Why did this command make my system lag so bad I had to reboot?.
    • My SSH session is shown below.
  • As you can see, the instance (quickly) ran out of memory, and my session terminated after a timeout.

I expected Instance Status Check to fail. However, both status checks continue to pass more than 20 minutes later. The instance is unresponsive to SSH and ping, although nmap reports that port 22 is open.

I was hoping to use the status check to determine if the instance was responsive and have its autoscaling group terminate and replace it, but it doesn't look like I'll be able to do that.

I have two questions:

  1. Why is the instance passing both status checks?
  2. Is there another solution (other than paying $18/month for a load balancer that isn't being used to balance load) to terminate instances that become unresponsive? Is there something I can do with cloudwatch alarms?
    • Ideally, I'd like to be able to have the instance report its health periodically, and if it fails to do so for a certain amount of time, terminate it (and let my autoscaling group take care of the rest).

My SSH session:

Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-57-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Thu Jul  9 18:50:39 UTC 2015

  System load: 0.0               Memory usage: 7%   Processes:       47
  Usage of /:  16.8% of 7.75GB   Swap usage:   0%   Users logged in: 0

  Graph this data and manage this system at:
    https://landscape.canonical.com/

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud


Last login: [[redacted]]
ubuntu@ip-172-31-18-225:~$ :(){ :|: & };:
[1] 1218
ubuntu@ip-172-31-18-225:~$ -bash: fork: Cannot allocate memory
-bash: fork: Cannot allocate memory
Connection to 52.2.62.141 closed.

Edit:
So, my real goal is to close the gap between what the status checks check for, and checking that my application is running. If the status checks really do check if the kernel is running properly, it seems to me that I could use a kernel software watchdog (like the softdog kernel module) to close that gap.

  • Do the status checks actually check that the kernel is running as it should?
  • If the status checks say the kernel is running, does that necessarily mean that all the kernel modules I've loaded are running properly?

Best Answer

Unresponsive != no heartbeats. The kernel is still running. AWS has no way of knowing that you've consumed all of your memory.

AWS Cloudwatch monitoring is really just the bare minimum you should do. If you need more detailed monitoring, you'll need to roll your own.

Related Topic