I have tested / been testing a server cluster locally for quite a while with no problem. I have recently set my server cluster up for a live test, and I have noticed problems, and believe that the HAProxy in my cluster may be running into some problems.
First I will go over a little bit of the structure of the cluster, maybe there is a problem with how I have them setup, maybe I will need multiple proxies.
I have two server clusters the HAProxy is balancing. We will call them SC1 and SC2. The main cluster is SC1, anything on port 80 for the HAProxy will be sent to SC1. SC1 will process the request, and send another request to SC2 through the proxy on port 8080. I wouldn't think this would be a problem, but I notice on my logs on my server often say SC1 cannot connect to SC2, I believe this is because my HAProxy is being overloaded.
The reason I am thinking the HAProxy is being overloaded is because when I look at my stats page, it often takes > 1sec to respond. Because of this I decided to take a look at the HAProxy logs. I have noticed an abnormality in the logs, that I believe may be linked to my problems. Every minute or so(sometimes more sometimes less), I get the following message:
Oct 8 15:58:52 haproxy rsyslogd-2177: imuxsock begins to drop messages from pid 3922 due to rate-limiting
Oct 8 15:58:52 haproxy kernel: [66958.500434] net_ratelimit: 2997 callbacks suppressed
Oct 8 15:58:52 haproxy kernel: [66958.500436] nf_conntrack: table full, dropping packet
I was wondering what the repercussions of this were. Would this just cause dropped packets, or could this cause delays as well? How can I fix this problem? I am running on Ubuntu 12.04LTS Server.
Here are my sysctl modifications:
fs.file-max = 1000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
Here is my config file:
global
log /dev/log local0 info
log /dev/log local0 notice
maxconn 50000
user u1
group g1
#debug
defaults
log global
mode http
option httplog
option dontlognull
option forwardfor
retries 3
option redispatch
option http-server-close
maxconn 50000
contimeout 10000
clitimeout 50000
srvtimeout 50000
balance roundrobin
listen sc1 255.255.255.1:80
maxconn 20000
server sc1-1 10.101.13.68:80 maxconn 10000
server sc1-2 10.101.13.66:80 maxconn 10000
listen sc1-1_Update 255.255.255.1:8181
maxconn 20000
server sc1-1 10.101.13.66:80 maxconn 20000
listen sc1-2_Update 255.255.255.1:8282
maxconn 20000
server sc1-2 10.101.13.68:80 maxconn 20000
listen sc2 255.255.255.1:8080
maxconn 30000
server sc2-1 10.101.13.74:80 maxconn 10000
server sc2-2 10.101.13.78:80 maxconn 10000
server sc2-3 10.101.13.82:80 maxconn 10000
listen sc2-1_Update 255.255.255.1:8383
maxconn 30000
server sc2-2 10.101.13.78:80 maxconn 15000
server sc2-3 10.101.13.82:80 maxconn 15000
listen sc2-2_Update 255.255.255.1:8484
maxconn 30000
server sc2-1 10.101.13.74:80 maxconn 15000
server sc2-3 10.101.13.82:80 maxconn 15000
listen sc2-3_Update 255.255.255.1:8585
maxconn 30000
server sc2-1 10.101.13.74:80 maxconn 15000
server sc2-2 10.101.13.78:80 maxconn 15000
listen stats :8888
mode http
stats enable
stats hide-version
stats uri /
stats auth user:pass
The sc1 and sc2 are the main clusters. All of the other ones I use when I have to update my servers(forward port 80 to 8181 on the haproxy for example to update server sc1-1).
Any help with this issue would be greatly appreciated.
Thank you
Best Answer
It looks like your connection tracking table is filling up. Removing iptables rules which use connection tracking would solve the problem.
If that is not an option and you have RAM available you can increase the table size:
You should probably increase the hashsize as well:
Those numbers are just double the default settings on my desktop, I'm not sure what exactly you would need. You'll also want to add that to sysctl.conf.
I would be really careful using
net.ipv4.tcp_tw_recycle
it can cause serious problems with NAT.