We've had our NAT server fail 3 times thus far – once due to an issue with it's EIP and twice due to it's host being CPU bound. While I understand these things happen, we can't afford to keep taking the outages. How can we implement a redundant, resilient NAT solution for our VPC? For example, is it possible to utilize multiple NAT servers?
My VPC (Amazon Virtual Private Cloud) consists of 2 subnets, 1 public and 1 private. Instances in the private subnet route through a NAT server in the public subnet. From what I've read, you can only have 1 NAT server per VPC.
Best Answer
Given your updated question you are presumably using the official Amazon Linux AMIs configured to run as NAT instances ('ami-vpc-nat') and setup according to NAT Instances? This is obviously not required, but provides a sound baseline to achieve the desired stability of course. Regarding your question:
Fortunately AWS has recently announced Elastic Network Interfaces in the Virtual Private Cloud, which allows you to Create a Low Budget High Availability Solution (please refer to the Elastic Network Interfaces user guide for details):
So you should be able to achieve your goal with a modest amount of automation code - depending on how much you value redundancy/resiliency you have two options:
You could implement this scenario via an Auto Scaling Policy, that maintains your NAT instance scaling level at 1 by means of an appropriate Health CheckUpdate: This wouldn't currently work as per sborsje's comment, thanks for the insight!