I have been beating my head against this HAProxy issue for entirely to long, and I am hoping someone here has seen this before.
Here is some information:
- Our HAP instances are in AWS across three regions
- The only traffic these instances are receiving are from our clients
- Each of our clients has an HAProxy install that forwards requests from their end users to us on 80 and 443 to 1025 and 1026 respectively.
- These requests are forwarded over TCP using proxy protocol to our HAP instances.
- Our HAP instances then SSL term the request and forward them off to our backend on port 80.
- Our routing is all done inside of Route53 with health checks.
- These health checks will mark as failed for the instance if the instance doesn't fails to reply back in time, three times in 30 seconds. The checks go off around 4 times a second.
Now for the problem: Every so often these servers are hanging on SSL handshakes, causing a the handshake to reset (found this in a tcpdump during a failure) and thus causing the health checks to hang long enough to cause a failure. This happened about 450 times over the weekend across 6 instances.
Memory and CPU aren't spikey enough to cause for any alarm, even during the hanging of the handshake.
Here is the config for the HAP Instances:
# HAProxy Config
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
log 127.0.0.1 local2
pidfile /var/run/haproxy.pid
maxconn 30000
user haproxy
group haproxy
daemon
ssl-default-bind-options no-sslv3 no-tls-tickets
tune.ssl.default-dh-param 2048
# turn on stats unix socket
# stats socket /var/lib/haproxy/stats`
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
retries 3
timeout http-request 5s
timeout queue 1m
timeout connect 31s
timeout client 31s
timeout server 31s
maxconn 15000
# Stats
stats enable
stats uri /haproxy?stats
stats realm Strictly\ Private
stats auth $StatsUser:$StatsPass
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
frontend shared_incoming
maxconn 15000
timeout http-request 5s
# Bind ports of incoming traffic
bind *:1025 accept-proxy # http
bind *:1026 accept-proxy ssl crt /path/to/default/ssl/cert.pem ssl crt /path/to/cert/folder/ # https
bind *:1027 # Health checking port
acl gs_texthtml url_reg \/gstext\.html ## allow gs to do meta tag verififcation
acl gs_user_agent hdr_sub(User-Agent) -i globalsign ## allow gs to do meta tag verififcation
# Add headers
http-request set-header $Proxy-Header-Ip %[src]
http-request set-header $Proxy-Header-Proto http if !{ ssl_fc }
http-request set-header $Proxy-Header-Proto https if { ssl_fc }
# Route traffic based on domain
use_backend gs_verify if gs_texthtml or gs_user_agent ## allow gs meta tag verification
use_backend %[req.hdr(host),lower,map_dom(/path/to/map/file.map,unknown_domain)]
# Drop unrecognized traffic
default_backend unknown_domain
#---------------------------------------------------------------------
# Backends
#---------------------------------------------------------------------
backend server0 ## added to allow gs ssl meta tag verification
reqrep ^GET\ /.*\ (HTTP/.*) GET\ /GlobalSignVerification\ \1
server server0_http server0.domain.com:80/GlobalSignVerification/
backend server1
server server1_http server1.domain.net:80
backend server2
server server2_http server2.domain.net:80
backend server3
server server3_http server3.domain.net:80
backend server4
server server4_http server4.domain.net:80
backend server5
server server5_http server5.domain.net:80
backend server6
server server6_http server6.domain.net:80
backend server7
server server7_http server7.domain.net:80
backend server8
server server8_http server8.domain.net:80
backend server9
server server9_http server9.domain.net:80
backend unknown_domain
timeout connect 4s
timeout server 4s
errorfile 503 /etc/haproxy-shared/errors/404.html
Best Answer
If SSL is involved I'd take a look at your entropy pool -- perhaps you've exhausted it. Keep in eye on "cat /proc/sys/kernel/random/entropy_avail" and see if it drops down to 0ish when you see the problem.
(Ref: https://major.io/2007/07/01/check-available-entropy-in-linux/)
If so you might look at installing rngd to add to the pool above what the kernel already does.