Auto-heal an EC2 instance with an Auto Scaling Group

amazon ec2amazon-web-servicesautoscalingaws-cli

I'm trying to setup an auto-healing EC2 instance using an Auto Scaling Group and a user-data startup script. If the current server has an issue where it is no longer reachable, the instance should terminate and a new one take its place. This is easy enough, but one requirement is proving difficult.

I need the replacement server to have the same private IP as the previous server. My thought is to have a secondary private IP (this is within a VPC) assigned to the original server, and then re-assign it to the new server.

  1. I assume I can use the aws-cli installed during the user-data startup script to re-assign the private IP, but how do I know which server is being replaced and re-assign the IP from it (for example, if in the future the pool of servers is larger and 2 happen to go down at the same time).

  2. If the original server is being terminated, am I going to be able to re-assign the private IP at all?

Best Answer

After a lot of research and trial/error, here's what we ended up doing:

  1. We put every server type into its own Auto Scaling Group with a user-data shell script to fully setup the servers.
  2. We utilized tags to keep track of resources on servers that we wanted to transfer (private IP, elastic IP, EBS, etc).
  3. On startup of a replacement instance, our user-data script queries the AWS CLI to get a terminated instance of the same type that has the resource tags available.
  4. We use the data in those tags and the AWS CLI to re-assign those resources to the replacement server and then remove the tags from the old server.
  5. Since this is all in a VPC, the private IP is still available to us since it gets released back to the VPC upon instance termination.

We've had this running for a few days now and it seems to be working quite well (though that remains to be seen when an instance actually fails for something other than us terminating it directly for testing).