AWS ELB – Stress test – Transient Error

amazon-elbamazon-web-servicesapache-2.2jmeter

I'm doing stress testing of our system. Currently we have 5 m1.large instances running behind ELB, sitting in east region. In west region, there are 3 small instances (with JMeter) that I use to hammer the system.

While doing a test that only pushes the app instances to about 80%-90% of their CPU limit (our choke point at the time), I'm seeing an odd behavior, ELB reports that ALL 5 instances are "Out of service – Transient Error – Please check later", all instances stop getting requests, and after about 5-10 seconds everything goes back to normal. This happens every 30 seconds or so. BUT! This doesn't happen every time I run the test. I just ran a half an hour stress test, with the same settings and everything worked perfectly. What is going on?

Btw my health check is

Ping Target:    HTTP:80/index.html
Timeout:    60 seconds
Interval:    300 seconds
Unhealthy Threshold:    10
Healthy Threshold:    2

So there is no way it's failing that. I've never ran into this until yesterday.

Best Answer

We were also having a transient "boxes fail health checks for no good reason" problem and from working with Amazon support it turns out there is an interaction between the ELBs and the Apache KeepaliveTimeout. If the health check interval is larger than the timeout then the healch checker can try to reuse a bad connection and it fails the test and tosses your instance out of the ELB. They called our 60 second interval "unusually long." We're messing with it now but try setting your interval low and matching it with the keepalive setting in Apache.

Related Topic