Auto-heal an EC2 instance with an Auto Scaling Group

amazon ec2amazon-web-servicesautoscalingaws-cli

I'm trying to setup an auto-healing EC2 instance using an Auto Scaling Group and a user-data startup script. If the current server has an issue where it is no longer reachable, the instance should terminate and a new one take its place. This is easy enough, but one requirement is proving difficult.

I need the replacement server to have the same private IP as the previous server. My thought is to have a secondary private IP (this is within a VPC) assigned to the original server, and then re-assign it to the new server.

I assume I can use the aws-cli installed during the user-data startup script to re-assign the private IP, but how do I know which server is being replaced and re-assign the IP from it (for example, if in the future the pool of servers is larger and 2 happen to go down at the same time).
If the original server is being terminated, am I going to be able to re-assign the private IP at all?

Best Answer

After a lot of research and trial/error, here's what we ended up doing:

We put every server type into its own Auto Scaling Group with a user-data shell script to fully setup the servers.
We utilized tags to keep track of resources on servers that we wanted to transfer (private IP, elastic IP, EBS, etc).
On startup of a replacement instance, our user-data script queries the AWS CLI to get a terminated instance of the same type that has the resource tags available.
We use the data in those tags and the AWS CLI to re-assign those resources to the replacement server and then remove the tags from the old server.
Since this is all in a VPC, the private IP is still available to us since it gets released back to the VPC upon instance termination.

We've had this running for a few days now and it seems to be working quite well (though that remains to be seen when an instance actually fails for something other than us terminating it directly for testing).

Related Solutions

Java – Approach to auto-scaling with Amazon AWS

The problem is that your Tomcat servers (and most likely your workers) don't know about the RabbitMQ server. You need to do 1 of 2 things in this scenario: (a) Tell them about the new server, or (b) Make it so that they don't care

For (a) above, you could notify each Tomcat server and worker when your new RabbitMQ server start, or put the info in some list that your other components references.

However, in this scenario, assuming you have a queue on RabbitMQ #1, what happens to that queue if you start RabbitMQ #2? You'll actually have 2 queues in this case, not a single queue spanning 2 servers. Does your application handle this?

For (b) above, you can take a look at RabbitMQ Clustering . My understanding is that with RabbitMQ clustering, you can have nodes come and go, and the clients shouldn't care.

AWS auto scaling setup bootstrap script and ssh access

So Centos AMI does not include CloudInit service by default (some of Ubuntu and Debian have it by default). You need to install it on your AMI, start the service on the boot:

chkconfig cloud-init on

Update the configuration file as needed: /etc/cloud/cloud.cfg Then you need to create a new AMI of the one modified. To test the bootstrap script the easiest I've found is to start a micro instance of this AMI specifying the --user-data-file option.

Best Answer

Related Solutions

Java – Approach to auto-scaling with Amazon AWS

AWS auto scaling setup bootstrap script and ssh access

Related Topic