AWS ELB Apache2 503 Service Unavailable: Back-end server is at capacity

503-errorapache-2.2

We've been running a couple websites off Amazons AWS infrastructure for about two years now and as of about two days ago the webserver started to go down once or twice a day with the only error I can find being:

HTTP/1.1 503 Service Unavailable: Back-end server is at capacity

No alarms (CPU/Disk IO/DB Conn) are being triggered by CloudWatch. I tried going to the site via the elastic IP to skip the ELB and got this:

HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.

I don't see anything out of the ordinary in the apache logs and verified that they were being properly rotated. I have no problems accessing the machine when it's "down" via SSH and looking at the process list I see 151 apache2 processes that appear normal to me. Restarting apache temporarily fixes the problem. This machine operates as just a webserver behind an ELB. Any suggestions would be greatly appreciated.

CPU Utilization
Average: 7.45%, Minimum: 0.00%, Maximum: 25.82%

Memory Utilization
Average: 11.04%, Minimum: 8.76%, Maximum: 13.84%

Swap Utilization
Average: N/A, Minimum: N/A, Maximum: N/A

Disk Space Utilization for /dev/xvda1 mounted on /
Average: 62.18%, Minimum: 53.39%, Maximum: 65.49%

Let me clarify I think the issue is with the individual EC2 instance and not the ELB I just didn't want to rule that out even though I was unable to reach the elastic IP. I suspect ELB is just returning the results of hitting the actual EC2 instance.

Update: 2014-08-26
I should have updated this sooner but the "fix" was to take a snapshot of the "bad" instance and start the resulting AMI. It hasn't gone down since then. I did look at the health check when I was still experiencing issues and could get to the health check page (curl http://localhost/page.html) even when I was getting capacity issues from the load balancer. I'm not convinced it was a health check issue but since no one, including Amazon, can provide a better answer I'm marking it as the answer. Thank you.

Update: 2015-05-06
I thought I'd come back here and say that part of the issue I now firmly believe was the health check settings. I don't want to rule out their being an issue with the AMI because it definitely got better after the replacement AMI was launched but I found out that our health checks were different for each load balancer and that the one that was having the most trouble had a really aggressive unhealthy threshold and response timeout. Our traffic tends to spike unpredictably and I think between the aggressive health check settings and the spikes in traffic it was a perfect storm. In diagnosing the issue I was focused on the fact that I could reach the health check endpoint at the moment but it is possible the health check had failed because of latency and then we had a high healthy threshold (for that particular ELB) so it would take while to see the instance as being healthy again.

Best Answer

You will get a "Back-end server is at capacity" when the ELB load balancer performs its health checks and receives a "page not found" (or other simple error) due to a mis-configuration (typically with the NameVirtual host).

Try grepping the log files folder using the "ELB-HealthChecker" user agent. e.g.

grep ELB-HealthChecker  /var/log/httpd/*

This will typically give you a 4x or 5x error which is easily fixed. e.g. Flooding, MaxClients etc is giving the problem way too much credit.

FYI Amazon: Why not show the returned response from request? Even a status code would help.