Nat – Advice regarding resilient internet access from within an Amazon VPC

amazon-vpcamazon-web-servicesnat;networking

We've had our NAT server fail 3 times thus far – once due to an issue with it's EIP and twice due to it's host being CPU bound. While I understand these things happen, we can't afford to keep taking the outages. How can we implement a redundant, resilient NAT solution for our VPC? For example, is it possible to utilize multiple NAT servers?

My VPC (Amazon Virtual Private Cloud) consists of 2 subnets, 1 public and 1 private. Instances in the private subnet route through a NAT server in the public subnet. From what I've read, you can only have 1 NAT server per VPC.

Best Answer

Given your updated question you are presumably using the official Amazon Linux AMIs configured to run as NAT instances ('ami-vpc-nat') and setup according to NAT Instances? This is obviously not required, but provides a sound baseline to achieve the desired stability of course. Regarding your question:

Fortunately AWS has recently announced Elastic Network Interfaces in the Virtual Private Cloud, which allows you to Create a Low Budget High Availability Solution (please refer to the Elastic Network Interfaces user guide for details):

If one of your instances [...] fails, its network interface can be attached to a replacement or hot standby instance pre-configured for the same role in order to rapidly recover the service. For example, you can use an ENI as your primary or secondary network interface to a critical service such as a database instance or a NAT instance. If the instance fails, you (or more likely, the code running on your behalf) can attach the ENI to a hot standby instance. Because the interface maintains its private IP address, elastic IP address, and MAC address, network traffic will begin flowing to the standby instance as soon as you attach the ENI to the replacement instance. [...] [emphasis mine].

So you should be able to achieve your goal with a modest amount of automation code - depending on how much you value redundancy/resiliency you have two options: