Overwhelmed by “TCP: time wait bucket table overflow” errors — What can I do to mitigate

conntracksysctltcptime-wait

I've got a legacy system running Debian 7 (proxmox) hosting OpenVZ containers, and I'm seeing a troublesome problem where the system is being overwhelmed by open connections to VZ container running the apache frontend.

When this is happening, the log on the server fills with thousands of "TCP: time wait bucket table overflow (CT233)" errors. This is coupled with slow responses from the webserver. Is there anything I can do to mitigate this problem?

After googling around, I've made some tweaks to various conntrack settings, but I've been reluctant to do anything too radical without a better understanding of what the repercussions could be (or, indeed, whether this was actually likely to be helpful in any case)

To get an idea what the situation is, here is the output of "sysctl -a | grep conntrack" when this was happening today:

net.netfilter.nf_conntrack_generic_timeout = 480
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 345600
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_acct = 0
net.netfilter.nf_conntrack_events = 1
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.netfilter.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_count = 128397
net.netfilter.nf_conntrack_buckets = 32768
net.netfilter.nf_conntrack_checksum = 1
net.netfilter.nf_conntrack_log_invalid = 0
net.netfilter.nf_conntrack_expect_max = 256
net.nf_conntrack_max = 131072

This includes a few changes that I made today: I doubled nf_conntrack_buckets from 16384 to 32768, I shrank conntrack_generic_timeout from 600s to 480s, and I shrank conntrack_tcp_timeout_established from 5d to 4d.

The vast majority of the open connections at any given time are in TIME_WAIT.

I'm hoping there is something that someone with more knowledge of TCP/Kernel tuning than I can recommend.

Thanks!

Best Answer

I ended up adjusting two other variables, doubling each of them: "net.ipv4.tcp_max_tw_buckets" and "net.ipv4.tcp_max_tw_buckets_ub", and since making those changes the "time wait bucket table overflow " errors have not reoccurred. I'm going to keep an eye on it over the course of the next week or so, however, and see if this has actually fixed the issue.