I was actually unable to reproduce this on:
2011/08/20 20:08:43 [notice] 8925#0: nginx/0.8.53
2011/08/20 20:08:43 [notice] 8925#0: built by gcc 4.1.2 20080704 (Red Hat 4.1.2-48)
2011/08/20 20:08:43 [notice] 8925#0: OS: Linux 2.6.39.1-x86_64-linode19
I set this up in my nginx.conf:
proxy_connect_timeout 10;
proxy_send_timeout 15;
proxy_read_timeout 20;
I then setup two test servers. One that would just timeout on the SYN, and one that would accept connections but never respond:
upstream dev_edge {
server 127.0.0.1:2280 max_fails=0 fail_timeout=0s; # SYN timeout
server 10.4.1.1:22 max_fails=0 fail_timeout=0s; # accept but never responds
}
Then I sent in one test connection:
[m4@ben conf]$ telnet localhost 2480
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
Host: localhost
HTTP/1.1 504 Gateway Time-out
Server: nginx
Date: Sun, 21 Aug 2011 03:12:03 GMT
Content-Type: text/html
Content-Length: 176
Connection: keep-alive
Then watched error_log which showed this:
2011/08/20 20:11:43 [error] 8927#0: *1 upstream timed out (110: Connection timed out) while connecting to upstream, client: 127.0.0.1, server: ben.dev.b0.lt, request: "GET / HTTP/1.1", upstream: "http://10.4.1.1:22/", host: "localhost"
then:
2011/08/20 20:12:03 [error] 8927#0: *1 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: ben.dev.b0.lt, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:2280/", host: "localhost"
And then the access.log which has the expected 30s timeout (10+20):
504:32.931:10.003, 20.008:.:176 1 127.0.0.1 localrhost - [20/Aug/2011:20:12:03 -0700] "GET / HTTP/1.1" "-" "-" "-" dev_edge 10.4.1.1:22, 127.0.0.1:2280 -
Here is the log format I'm using which includes the individual upstream timeouts:
log_format edge '$status:$request_time:$upstream_response_time:$pipe:$body_bytes_sent $connection $remote_addr $host $remote_user [$time_local] "$request" "$http_referer" "$http_user_agent" "$http_x_forwarded_for" $edge $upstream_addr $upstream_cache_status';
Key points:
- Don't bother with
upstream
blocks for failover, if pinging one server will bring another one up - there's no way to tell nginx (at least, not the FOSS version) that the first server is up again. nginx will try the servers in order on the first request, but not follow-up requests, despite any backup
, weight
or fail_timeout
settings.
- You must enable
recursive_error_pages
when implementing failover using error_page
and named locations.
- Enable
proxy_intercept_errors
to handle error codes sent from the upstream server.
- The
=
syntax (e.g. error_page 502 = @handle_502;
) is required to correctly handle error codes in the named location. If =
is not used, nginx will use the error code from the previous block.
Here is a summary:
server {
listen ...;
server_name $DOMAINS;
recursive_error_pages on;
# First, try "Upstream A"
location / {
error_page 418 = @backend;
return 418;
}
# Define "Upstream A"
location @backend {
proxy_pass http://$IP:81;
proxy_set_header X-Real-IP $remote_addr;
# Add your proxy_* options here
}
# On error, go to "Upstream B"
error_page 502 @handle_502;
# Fallback static error page, in case "Upstream B" fails
root /home/nginx/www;
location = /_static_error.html {
internal;
}
# Define "Upstream B"
location @handle_502 { # What to do when the backend server is not up
proxy_pass ...;
# Add your proxy_* options here
proxy_intercept_errors on; # Look at the error codes returned from "Upstream B"
error_page 502 /_static_error.html; # Fallback to error page if "Upstream B" is down
error_page 451 = @backend; # Try "Upstream A" again
}
}
Original answer / research log follow:
Here's a better workaround I found, which is an improvement since it doesn't require a client redirect:
upstream aba {
server $BACKEND-IP;
server 127.0.0.1:82 backup;
server $BACKEND-IP backup;
}
...
location / {
proxy_pass http://aba;
proxy_next_upstream error http_502;
}
Then, just get the control server to return 502 on "success" and hope that code is never returned by backends.
Update: nginx keeps marking the first entry in the upstream
block as down, so it does not try the servers in order on successive requests. I've tried adding weight=1000000000 fail_timeout=1
to the first entry with no effect. So far I have not found any solution which does not involve a client redirect.
Edit: One more thing I wish I knew - to get the error status from the error_page
handler, use this syntax: error_page 502 = @handle_502;
- that equals sign will cause nginx to get the error status from the handler.
Edit: And I got it working! In addition to the error_page
fix above, all that was needed was enabling recursive_error_pages
!
Best Answer
There's no option for
proxy_next_upstream
to implement the behavior you describe.Your application should not return an
HTTP 200
if it couldn't actually process the request. Have the application return a more appropriate error, such as500
or503
.