AWS Application ELB sometimes responds 502 without calling the backend

amazon-elbamazon-web-services

We have the following setup: An AWS Application ELB has a target group of 4 EC2 instances. On each of the EC2 instances, there is an Apache2 which forwards to a Tomcat.

It works brilliantly. Except sometimes, when the ELB seemingly refuses to speak to a healthy instance and responds 502 instead. This is what it looks like in the ELB access log:

https 2018-03-25T09:36:13.797244Z app/XXXXXX 178.162.220.236:50254 172.31.24.81:1443 0.001 0.031 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8d-2f54c360354a57532ccc55ef" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:14.169090Z app/XXXXXX 178.162.220.236:50270 172.31.22.59:1443 0.001 0.096 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8e-748fc1e34bee608baa1efa81" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:18.785982Z app/XXXXXX 178.162.220.236:50330 172.31.19.151:1443 0.001 0.034 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d92-34ac7ac52b69c34fdfdbce33" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:28.457341Z app/XXXXXX 178.162.220.236:50808 172.31.28.152:1443 0.001 0.075 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9c-66b8f0a5c6be9ff6c098dfe4" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:30.615879Z app/XXXXXX 178.162.220.236:50900 172.31.24.81:1443 0.001 0.032 0.000 204 204 314 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-d1d310c5d55ad09295bb08b9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:30.955285Z app/XXXXXX 178.162.220.236:50914 172.31.19.151:1443 0.001 0.091 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-079416df92b2cd27865c8b93" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:36.843389Z app/XXXXXX 178.162.220.236:51090 172.31.28.152:1443 0.001 0.081 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da4-917b032b1fb8c1c0dc914354" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:40.432293Z app/XXXXXX 178.162.220.236:51184 172.31.24.81:1443 0.002 0.076 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da8-1ee52e1fa688cc81071fc8d1" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:41.398389Z app/XXXXXX 178.162.220.236:51208 172.31.19.151:1443 0.001 0.033 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da9-2ee889d47ead389675818ea6" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:41.775607Z app/XXXXXX 178.162.220.236:51216 172.31.28.152:1443 0.000 0.000 -1 502 - 314 293 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da9-fad6ea5fa8cc956c24d65206" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:46.954552Z app/XXXXXX 178.162.220.236:50912 172.31.22.59:1443 0.001 16.118 0.000 204 204 293 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d9e-9c0827e71b334dd908000419" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:50.750077Z app/XXXXXX 178.162.220.236:51188 172.31.22.59:1443 0.001 10.096 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76da8-f99e3281d1e0e2a73c7c4ba9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:52.336104Z app/XXXXXX 178.162.220.236:51510 172.31.24.81:1443 0.001 0.115 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76db4-2640adae88a8639f196613e9" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:55.563397Z app/XXXXXX 178.162.220.236:51604 172.31.22.59:1443 0.000 0.104 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76db7-ecff4da4c9414383ce128eae" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:58.824749Z app/XXXXXX 178.162.220.236:51682 172.31.19.151:1443 0.001 0.034 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76dba-425a808b5efc4c3cb4cb7f53" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:37:09.004924Z app/XXXXXX 178.162.220.236:51994 172.31.28.152:1443 0.001 0.135 0.000 204 204 655 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76dc4-5d8dce64e6fcb2aa069e9018" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0
https 2018-03-25T09:36:11.146813Z app/XXXXXX 178.162.220.236:50212 172.31.24.81:1443 0.001 0.031 0.000 204 204 374 210 "POST https://XXXXXX.com:443/api/event?apikey=XXXXXX HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-central-1:709725828154:targetgroup/XXXXXX "Root=1-5ab76d8b-ce30a6a810a89110465a9944" "XXXXXX.com" "arn:aws:acm:eu-central-1:709725828154:certificate/91f68487-7f5d-4ed5-80b9-4a9839ebe0f1" 0

Note the 502 as the elb_status_code in the 10th line. Also note the -1 as the target_status_code.

There is absolutely no trace of that request neither in the apache2 logs, nor the tomcat logs. All four instances remain healthy throughout. The same IP (172.31.28.152 in this case) correctly processes requests before and after.

The sending system is under our control. All the requests are uniform, so it is not some malformed request not being processed correctly. They all look the same.

Any ideas why this could happen? I assume AWS do not sell faulty load balancers…

Best Answer

The solution was in the comments. The ELB has a default connection KeepAlive of 60 seconds. Apache2 has a default connection KeepAlive of 5 seconds. If the 5 seconds are over, the Apache2 closes its connection and resets the connection with the ELB. However, if a request comes in at precisely the right time, the ELB will accept it, decide which host to forward it to, and in that moment, the Apache closes the connection. This will result in said 502 error code.

The solution is: When you have cascading proxies/LBs, either align their KeepAlive timeouts, or - preferrably - even make them a little longer the further down the line you get.

We set the ELB timeout to 60 seconds and the Apache2 timeout to 120 seconds. Problem gone.