Spot instances frequently terminated in AWS auto scaling group (failing system health check)

amazon ec2amazon-web-servicesautoscaling

We have 2 auto scaling groups (one for on-demand and one for spot instances) which are both set to a static number of instances (min, max, and desired are all the same – 5 in our case). The instances in the on-demand group stay running, but the ones in the spot group are frequently terminated due to a system health check. The message shown for a terminated instance in the Scaling History tab in the EC2 Management Console is e.g.:

"At 2014-05-07T18:06:45Z an instance was taken out of service in
response to a system health-check."

I don't know why our spot instances are failing a health check. Our bid price is high, and I don't think the instances should have been terminated due to spot price (based on spot pricing history). I've adjusted the AZs that the instances are launched in also, and I don't see a difference. I don't see any suspicious messages when I check the syslog of a recently terminated instance. We're using a private/custom AMI for both groups, but I see the same behavior when I switch to a more generic AMI (the "Ubuntu 12.04 LTS Precise EBS boot" image listed on alestic.com – ami-5db4a934). Again, our on-demand instances stay running and don't fail health checks. We're using the "EC2" health check type.

Here is the command we're using to create our launch configuration via the AWS CLI:

aws autoscaling create-launch-configuration \
--launch-configuration-name [name] \
--image-id ami-5db4a934 \
--key-name [our key] \
--security-groups [our SGs] \
--instance-type m3.xlarge \
--block-device-mappings '[ { "DeviceName": "/dev/sda1", "Ebs": { "VolumeSize": 8 } } ]' \
--spot-price "1.00"

Does anyone know what this might be or how we can get more visibility into why the spot instances are failing health checks?

Best Answer

Update

Our bid price is high, and I don't think the instances should have been terminated due to spot price (based on spot pricing history)

Spot price contention is not the only possible cause for an Amazon EC2 Spot Instance being terminated by AWS, another notable one is capacity contention:

The capacity of available spot instances depends on the demand for regular instances, and if there aren't any instances of a specific type available for users requesting regular on demand instances, AWS will start terminating spot instances to fulfill those requests.
In fact I've encountered that in us-east-1 more often than elsewhere so far, and much more frequently in recent month for the new m3/c3/i3 instance type families (an understandable effect of ramping up capacity over time).

You can verify the actual cause of a spot request termination manually in the AWS Management Console or e.g. via the AWS CLI's describe-spot-instance-requests. For advanced spot instance usage I'd recommend to start Tracking Spot Requests with Bid Status Codes and correlate these with your instance terminations for the best operational insight. See the Life Cycle of a Spot Requests and the Spot Bid Status Code Reference for more details, specifically the following reasons for spot termination by AWS:

instance-terminated-by-price

The Spot Price rose above your bid price. If your request is a persistent bid, the process—or life cycle—restarts and your bid will again be pending evaluation.
instance-terminated-no-capacity

There is no longer any Spot capacity available for the instance.
instance-terminated-capacity-oversubscribed

Your instance was terminated because the number of Spot requests with bid prices equal to or higher than your bid price has exceeded the available capacity in this pool. This means that your instance was interrupted even though the Spot Price may not have changed because your bid was at the Spot Price.
instance-terminated-launch-group-constraint

One of the instances in your launch group was terminated, so the launch group constraint is no longer fulfilled.

Initial Answer

"At 2014-05-07T18:06:45Z an instance was taken out of service in response to a system health-check."

This misleading message simply is the one reported when the Amazon EC2 Spot Instance has been terminated due to spot price contention, see e.g. the AWS team's response to Auto Scaling Message & Spot Instance Termination:

You are correct the instance was terminated due to spot pricing.

The instance terminated right before the health-check so it was taken out of service since it was still associated to the AS group.

While it escapes me why AWS hasn't managed to come up with a better integration between Auto Scaling and Amazon EC2 yet in this regard, it makes more sense when considering that these are two separate services in fact, so if the 'external' spot market backend terminates an EC2 instance, it will simply become 'unhealthy' from an Auto Scaling point of view - this is sort of documented in Obtaining Information About the Instances Launched by Auto Scaling:

Cause: At 2012-06-01T00:47:51Z an instance was taken out of service in response to a system health-check. Description: Terminating EC2 instance: i-88ce28f1

Auto Scaling maintains the desired number of instances by monitoring the health status of the instances in the Auto Scaling group. When Auto Scaling receives notification that an instance is unhealthy or terminated, Auto Scaling launches another instance to take the place of the unhealthy instance. [...]

Note
Auto Scaling provides the cause of instance termination that is not the result of a scaling activity. This includes instances that have been terminated because the Spot Price exceeded their bid price. [emphasis mine]

Best Answer

Update

Initial Answer

Related Solutions

Amazon EC2 spot instances – is there a catch

Are Amazon EC2 Spot Instances terminated only on even hour intervals

Related Topic