Should I worry about hanging sockets when setting keep-alive timeout to Infinity

amazon ec2amazon-web-serviceshttpnode.jssocket

Some initial context to this question. Currently I have a application cluster deployed behind a ALB which maintains persistent keep alive connections with the application. This application is under continuous heavy load and must have very high uptime. The ALB has been sending back 502 Bad Gateway status codes from this service. Digging deeper and after taking a pcap capture and sysdig capture on the affected instances we see the following (ordered by sequence of events):

19:51:26.881806  10.23.34.195    10.23.67.39 HTTP    1068    POST /api HTTP/1.1  (application/json)
19:51:26.881838  10.23.67.39 10.23.34.195    TCP 66  80→52026 [ACK] Seq=7201 Ack=19033 Win=67072 Len=0 TSval=240987 TSecr=1566420
19:51:27.018305861 0 node (2989) > writev fd=120(<4t>10.23.34.195:52026->172.17.0.2:3000) size=400
19:51:27.018326  10.23.67.39 10.23.34.195    HTTP    466 HTTP/1.1 200 OK  (application/json)
19:51:27.018341806 0 node (2989) < writev res=400 data=HTTP/1.1 200 OK..Content-Type: application/json; charset=
19:51:27.018601  10.23.34.195    10.23.67.39 TCP 66  52026→80 [ACK] Seq=19033 Ack=7601 Win=47360 Len=0 TSval=1566454 TSecr=241021
19:51:32.042525  10.23.34.195    10.23.67.39 HTTP    1066    POST /api HTTP/1.1  (application/json)
19:51:32.042538  10.23.67.39 10.23.34.195    TCP 66  80→52026 [ACK] Seq=7601 Ack=20033 Win=69120 Len=0 TSval=242277 TSecr=1567710
19:51:32.066469320 0 node (2989) > close fd=120(<4t>10.23.34.195:52026->172.17.0.2:3000)
19:51:32.066470002 0 node (2989) < close res=0
19:51:32.066487  10.23.67.39 10.23.34.195    TCP 66  80→52026 [RST, ACK] Seq=7601 Ack=20033 Win=69120 Len=0 TSval=242283 TSecr=1567710

As stated above, it appears that our nodejs application reaches 5 seconds of inactivity on a keep-alive connection (default keep alive timeout period), then receives a network request, then closes the socket and finally responds to the queued network request with a RST.

So it appears that the 502's are due to a race condition where a new request is received from the load balancer between or during the TCP teardown.

The most apparent solution to this problem would to ensure that the load balancer is the source of truth when tearing down these connections, ensuring that the load balancers idle timeout is less than the timeout on the application server. This solution works with AWS classic load balancers but not ALB as according to their docs:

You can set an idle timeout value for both Application Load Balancers
and Classic Load Balancers. The default value is 60 seconds. With an
Application Load Balancer, the idle timeout value applies only to
front-end connections.

http://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html

Could someone speculate as to why AWS may have removed the backend idle timeout (I'm assuming it is infinity)? I could set the keep alive timeout on the node server to Infinity as well but should I worry about leaking sockets? Are their any other server technologies that handle this problem more gracefully that I could apply to fix this issue (without using classic load balancers)?

Also AWS support states they will not respect a Keep-Alive header sent back from the service.

Best Answer

We cannot know the reason why they don't have it. We cannot know AWS design / implementation decisions which cause the behaviour. Only people who work with Amazon on this feature know, and they are most likely under NDA.

Your only chance to get a valid answer for this is to ask AWS.