High availability and AWS VPN

amazon-web-serviceshigh-availabilitysite-to-site-vpnvpn

I have a question about achieving high availability over an AWS VPN.

Context:

I have a requirement to establish a site to site VPN connection between my AWS VPC and a big corporate. This VPN link is required to support inbound http connections to an application in my AWS VPC.
For various reasons, the big corporate I am connecting to will not allocate me a non-conflicting private (RFC1918) subnet to use. Instead, they require me to use NAT to expose my services over the VPN on a public IP of my choice (one that I will reserve).
I believe that I have managed to successfully set this up (subject to testing) with the right combination of routing rules by following this guide (without a separate proxy), however I can only direct incoming connections to a single IP.

Question:

I am wondering if there is a way that I can direct connections to a highly available load balancer instead? This would allow better scalability and also high availability. Things I have considered:

  • Using an AWS external ELB:
    • These do not have reliable IP addresses or a narrow enough range. I would not be able to add such a range on the right-side routing rules as such a range could conflict with any other service hosted on AWS.
  • Using an AWS internal ELB:
    • These have a public DNS record and a predictable range, however these are in the private IP ranges and so cannot be used by the client system as they are only allowed to create static routes to non RFC1918 public IPs.
  • Implementing my own load balancer such as HAproxy:
    • This would address the scalability issue, but would still leave a single point of failure in the system (the HA node itself).
    • Additionally this is one more machine I have to maintain.

Does anyone know if there is a way to reference an ELB in the VPC routing tables? Or have any other suggestions on how to achieve this?

Thanks

Best Answer

The machine terminating the tunnel is a single point of failure already, isn't it? If so, running HAProxy right there seems like the thing to do (and I'm not just saying so because that's the way I do it, even though it is).

I can count my production outages caused by haproxy on one hand without using any fingers (or thumb). Asynchronous DNS in version 1.6 (still in development as of this writing) would let you use an internal ELB as a back-end to haproxy, allowing you to pretty much set and forget and use the existing ELB/EC2 integrations for your actual capacity scaling.

C3, C4, M3, R3, and T2 instance types also support the relatively new instance recovery feature, which stops, recreates, and restarts your instance on different hardware but with the same instance id, elastic IP, and EBS volumes if it stops responding favorably to instance health checks.

Related Topic