How to keep up with Nagios/Capistrano configs when using EC2

amazon ec2amazon-web-servicescapistranoconfiguration-managementnagios

I use Amazon EC2 for my mobile app. Depending on load of the application at a given time, I might spawn new instances and then take them down when load is lower to save costs.

How does one keep up with Nagios configurations for such a dynamic environment? When one deals with managed hardware, configuration files are predictable. In this case Nagios, Capistrano and a bunch of other configuration files would need to be added. Capistrano needs to know where to deploy a new build to for an app server. Nagios needs to know to remove an existing instance or add a new instance for monitoring. Nagios also needs to know if a node was intentionally taken down or if the host is down due to error.

How is this done with the wonderful world of VPS/dynamic instances?

Best Answer

We use a configuration management tool (Chef in our case) which writes out Nagios configuration from the node information.

Related Solutions

Nagios automation in big scale

I've implemented three slightly different solutions for Nagios monitoring using Chef over the last 18 months. They're all based around Chef's template resource for generating configuration files using the ERB syntax and that bit has worked really well. You have a Ruby array or hash of hosts and services, and Nagios configuration files are generated. It's pretty easy to test and debug.

Completely data bag-based configuration. In this case there's a nagios_hosts and a nagios_services data bag and each host has a key that says which service checks get run, e.g check_load, check_disk. This setup is quick to get going and works reasonably well, although if hosts are deleted or new ones added someone has to be around to update the data bags. In practice it's easy to forget about this and things can get out of date which can lead to trouble.
Chef attribute-based configuration. Here I used the Chef REST API to query one or more Chef servers to pull down lists of nodes and assign service checks to them based on roles they were assigned. Having a dependency on Chef means that it's difficult to monitor non-Chef systems, e.g appliances, network devices, or nodes that don't run Chef for whatever reason. Chef ends up sending a huge amount of JSON data over the network for large numbers of nodes and processing all this data puts a load on the Chef server(s) as well as the Nagios server when it generates configuration files.
Rails app generating Nagios configuration files. I ended up breaking the Chef dependency by storing Nagios configuration information in a database and having a Rails app generate the configuration files. Each Nagios server makes a REST request and downloads it's configuration files that are generated using ERB and a MySQL database. It's quite a bit of work to get this going, but so far it's working well for monitoring Chef and non-Chef nodes.

So after going through all of that I would probably recommend using something like option #2 for small (tens to hundreds) of nodes. I would try and keep it simple though. I used Chef's attribute system to define and override thresholds for the service checks based on roles and while it works, it's way too complicated and the cookbook has ended up becoming an unmaintainable mess.

Good luck!

AWS EC2 Instance Auto Scaling Confusion

Let's go over your questions.

So basically I want my original instance to be running at all times. Then when it starts going over capacity I want the Auto Scaling Group to start launching instances and the Load Balancer to distribute the load across them. Is my thinking here sound?

I'd say yes, but I do have a couple reservations. If I understand correctly, you've placed your "main" instance outside of the auto scaling group. Theoretically, that would ensure that auto scaling doesn't kill off your original instance. There are a couple of things I'd like to mention:

You're not making full use of the possibilities of Auto Scaling. Auto Scaling not only enables your setup to scale, but it can also ensure limits. If, for whatever reason, your "main" instance dies, your auto scaling policy won't come into action. If you keep your instance in an auto scaling group with a min-size of 1, Auto Scaling automatically replaces the failed instance.
When auto scaling, it's often best practise to treat your instances as being "disposable", because that's how you build resilient systems. Don't depend on one instance to always be available.
You can set the termination policy for your auto scaling group so that it always terminates the newest instances first. That would ensure your "main" instance will be kept (as long as it's healthy). My previous comment still applies though.

When I make code and data changes to my original instance, do I have to remake the image my Launch Configuration uses?

I'd say no, but that's more of a design issue. Your image should describe the state of your server, but it shouldn't be responsible for code distribution. Consider a situation where you'd have to update your application because of an urgent bug, but your servers are under high load. Does updating your main server, creating an AMI, updating your launch config and killing off your auto scaled servers so they can be respawned with the latest AMI sound like an attractive solution? My answer to that would be no (again). Look into source code version control and deployment strategies (I'm a Rails developer 60% of the time and use git and capistrano, for instance).

There are situations where your approach would work as well and there is a lot of middle ground here (I would recommend also looking into Chef and userdata scripts). I myself actually rarely use custom AMIs, thanks to Chef.

What needs to be down with DNS names and IPs? I'm currently using Route 53, do I make that point to my Load Balancer and that's it?

Basically, yes. You can select the loadbalancer(s) that should be attached to new instances when creating your auto scaling group. I haven't used the GUI for Auto Scaling yet, but I'm quite sure it's in there somewhere. If not, the CLI still supports it. Point your Route53 record to your ELB alias and that's basically it.

Response to additional questions (2014/02/23):

If you're developing using Rails, I can highly recommend Capistrano for deployments, which can take a specific version of your app in your preferred version control system (like git) and deploy it to a number of servers in a specific environment. There are a bunch of tutorials out there, but Ryan Bates' revised (and pro) Railscasts on the subject are very concise, although you need a subscription to his website to watch both of them. Most of the other tutorials will get you going as well though. Fire up a new server with your AMI and a launch script that pulls a temporary clone of your git repo and runs a local Capistrano command to get your app going. This ensures that, later on, you can also deploy new versions of your application using Capistrano with just one command to all running servers.

Capistrano also provides a couple of other benefits, including enabling you to execute specific tasks on all or just one of your servers and roll back a deployment, which is very hard to accomplish using just git.

Best Answer

Related Solutions

Nagios automation in big scale

AWS EC2 Instance Auto Scaling Confusion

Related Topic