EC2 VPC Intermittent outbound connection timeouts

amazon ec2amazon-elbamazon-vpcamazon-web-services

My production web service consists of:

  • Auto-scaling group
  • Network loadbalancer (ELB)
  • 2x EC2 instances as web servers

This configuration was running fine until yesterday when one of the EC2 instances started to experience RDS and ElastiCache timeouts. The other instance continues to run without issues.

During investigation, I noticed that outgoing connections in general sometimes experience large delays:

[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m7.147s -- 7 seconds
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m3.114s
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m0.051s
user    0m0.006s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    1m6.309s -- over a minute!
user    0m0.009s
sys     0m0.000s

[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  * * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  216.182.226.174  17.706 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.8.4), 1 hops max, 60 byte packets
 1  216.182.226.174  20.364 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.132), 1 hops max, 60 byte packets
 1  216.182.226.170  12.680 ms  12.671 ms *

Further analysis shows that if I manually detach the 'bad' instance from the auto-scaling group, removing it as a load balancer target, the problem instantly goes away. As soon as I add it back, the problem returns.

These nodes are m5.xlarge and appear to have excess capacity, so I don't believe it's a resource issue.

UPDATE: It seems related to load on the node. I put load back on last night and it seemed stable, but this morning as load is growing, outbound traffic (DB, etc.) start to fail. I'm really stuck not understanding how this outbound traffic is being impacted at all. The other identical node has no issues, even with 100% of the traffic versus 50%.

traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  216.182.226.174  18.691 ms 216.182.226.166  18.341 ms 216.182.226.174  18.660 ms
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  * * *

What is the 216.182.226.166 IP? Is it related to the VPC IGW?

Node stats:

  • m5.xlarge
  • CPU ~ 7.5%
  • load average: 0.18, 0.29, 0.29
  • Network IN: ~8M bytes/minute

UPDATE: With 1 of the 2 nodes attached to the load balancer, things appear to run stable — with all traffic on one node. After I add the 2nd node to the load balancer, after some period of time (hours – days), one of the nodes starts to exhibit outbound connection issues describe above (connection timing out to database, Google, etc.). In this state, the other node is working fine. Replacing the 'bad' or reinstating it in the load balancer allow things to run fine for a while. These images use Amazon Linux 2 (4.14.114-103.97.amzn2.x86_64).

Best Answer

It is possible you are using a NAT gateway/Instance to reach out to internet. If not, you may have to give more information on architecture. you could be using a direct connect and possibly routing internet via on-prem network.

Please read these relating to system limits, inbound connections for ephemeral ports.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-recommended-nacl-rules.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-connection-nat-instance/

Related Topic