How to deploy a scalable, reliable haproxy cluster on Amazon EC2

amazon ec2amazon-web-serviceshaproxyheartbeatload balancing

We need some more advanced functionality than ELB provides (mostly L7 inspection), but it's not obvious how to handle things like heartbeat and high availability with something like haproxy using EC2. There's a high likelihood we'd need 3 or more haproxy nodes in the cluster, so simple heartbeat between two nodes isn't going to work.

Seems like having a heartbeat'd layer in front of the haproxy nodes would be the way to go, possibly using IPVS, but handling the configuration changes as the EC2 cluster changes (either via intentional changes, like expansion, or unintentional, like losing an EC2 node) seems non-trivial.

Preferably the solution would span at least two Availability Zones.

In answer to Qs: No, sessions aren't sticky. And yes, we'll need SSL, but that could in theory be handled by another setup entirely – we're able to direct SSL traffic to a different location than non-SSL traffic.

Best Answer

OK, I've never built an AWS load balancing solution with traffic on the levels of SmugMug myself, but just thinking of theory and AWS's services, a couple of ideas come to mind.

The original question is missing a few things that tend to impact the load balancing design:

Sticky sessions or not? It is very preferable to not use sticky session, and just let all load balancers (LB's) use round robin (RR) or random backend selection. RR or random backend selections are simple, scalable, and provide even load distribution in all circumstances.
SSL or not? Whether SSL is in use or not, and over which percentage of requests, generally has a impact on the load balancing design. It is often preferable to terminate SSL as early as possible, to simplify certificate handling and keep the SSL CPU load away from web application servers.

I'm answering from the perspective of how to keep the load balancing layer itself highly available. Keeping the application servers HA is just done with the health checks built into your L7 load balancers.

OK, a couple of ideas that should work:

1) "The AWS way":

First layer, at the very front, use ELB in L4 (TCP/IP) mode.
Second layer, use EC2 instances with your L7 load balancer of choice (nginx, HAProxy, Apache etc).

Benefits/idea: The L7 load balancers can be fairly simple EC2 AMI's, all cloned from the same AMI and using the same configuration. Thus Amazon's tools can handle all HA needs: ELB monitors the L7 load balancers. If a L7 LB dies or becomes unresponsive, ELB & Cloudwatch together spawn a new instance automatically and bring it into the ELB pool.

2) "The DNS round robin with monitoring way:"

Use basic DNS round robin to get a coarse-grained load distribution out over a couple of IP addresses. Let's just say you publish 3 IP addresses for your site.
Each of these 3 IP's is an AWS Elastic IP Address (EIA), bound to a EC2 instance, with a L7 load balancer of your choice.
If a EC2 L7 LB dies, a compliant user agent (browser) should just use one of the other IPs instead.
Set up an external monitoring server. Monitor each of the 3 EIPs. If one becomes unresponsive, use AWS's command line tools and some scripting to move the EIP over to another EC2 instance.

Benefits/idea: Compliant user agents should automatically switch over to another IP address if one becomes unresponsive. Thus, in the case of a failure, only 1/3 of your users should be impacted, and most of these shouldn't notice anything since their UA silently fails over to another IP. And your external monitoring box will notice that an EIP is unresponsive, and rectify the situation within a couple of minutes.

3) DNS RR to pairs of HA servers:

Basically this is Don's own suggestion of simple heartbeat between a pair of servers, but simplified for multiple IP addresses.

Using DNS RR, publish a number of IP addresses for the service. Following the example above, let's just say you publish 3 IPs.
Each of these IP's goes to a pair of EC2 servers, so 6 EC2 instances in total.
Each of these pairs uses Heartbeat or another HA solution together with AWS tools to keep 1 IP address live, in a active/passive configuration.
Each EC2 instance has your L7 load balancer of choice installed.

Benefits/idea: In AWS' completely virtualized environment it's actually not that easy to reason about L4 services and failover modes. By simplifying to one pair of identical servers keeping just 1 IP address alive, it gets simpler to reason about and test.

Conclusion: Again, I haven't actually tried any of this in production. Just from my gut feeling, option one with ELB in L4 mode, and self-managed EC2 instances as L7 LBs seems most aligned with the spirit of the AWS platform, and where Amazon is most likely to invest and expand later on. This would probably be my first choice.

Related Solutions

Amazon S3 – How to Get the Size of an Amazon S3 Bucket

The AWS CLI now supports the --query parameter which takes a JMESPath expressions.

This means you can sum the size values given by list-objects using sum(Contents[].Size) and count like length(Contents[]).

This can be be run using the official AWS CLI as below and was introduced in Feb 2014

 aws s3api list-objects --bucket BUCKETNAME --output json --query "[sum(Contents[].Size), length(Contents[])]"

Nginx – How many reverse proxies (nginx, haproxy) is too many

From a purely performance perspective, let benchmarking make these decisions for you rather than assuming -- using a tool like httperf is invaluable when making architecture changes.

From an architectural philosophy perspective, I'm a little curious why you have both nginx and apache on the application servers. Nginx blazes at static content and efficiently handles most backend frameworks/technologies (Rails, PHP via FastFCGI, etc), so I would drop the final Apache layer. Once again, this comes from a limited understanding of the technologies that you're using, so you may have a need for it that I'm not anticipating (but if that's the case, you could always drop nginx on the app servers and just use apache -- it's not THAT bad at static content when configured properly).

Currently, I use nginx -> haproxy on load balancing servers and nginx on the app servers with much success. As Willy Tarreau stated, nginx and haproxy are a very fast combination, so I wouldn't worry about the speed of having both on the front-end, but keep in mind that adding additional layers increases complexity as well as the number of points of failure.

Best Answer

Related Solutions

Amazon S3 – How to Get the Size of an Amazon S3 Bucket

Nginx – How many reverse proxies (nginx, haproxy) is too many

Related Topic