Nginx – Why is the web server dropping connections with a TCP reset at high load

I have a small VPS setup with nginx. I want to squeeze as much performance as possible out of it, so I've been experimenting with optimization and load testing.

I'm using Blitz.io to do load testing by GETing a small static text file, and running into an odd issue where the server appears to be sending TCP resets once the number of simultaneous connections reaches roughly 2000. I know this is a very large amount, but from using htop the server still has plenty to spare in CPU time and memory, so I would like to figure out the source of this issue to see if I can push it even further.

I am running Ubuntu 14.04 LTS (64-bit) on a 2GB Linode VPS.

I don't have enough reputation to post this graph directly, so here's a link to the Blitz.io graph:

enter image description here

Here are things I've done to try and figure out the source of the issue:

The nginx config value worker_rlimit_nofile is set to 8192
have nofile set to 64000 for both hard and soft limits for root and www-data user (what nginx runs as) in /etc/security/limits.conf
there are no indications anything is going wrong in /var/log/nginx.d/error.log (typically, if you are running into file descriptor limits, nginx will print error messages saying so)
I have ufw setup, but no rate limiting rules. The ufw log indicates nothing is being blocked and I have tried disabling ufw with the same result.
There are no indicative errors in /var/log/kern.log
There are no indicative errors in /var/log/syslog

I have added the following values to /etc/sysctl.conf and loaded them with sysctl -p with no effect:

net.ipv4.tcp_max_syn_backlog = 1024
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 2000

Any ideas?

EDIT: I did a new test, ramping to 3000 connections on a very small file (only 3 bytes). Here's the Blitz.io graph:

Blitz.io graph

Again, according to Blitz all of these errors are "TCP Connection reset" errors.

Here's the Linode bandwidth graph. Keep in mind this is a 5 minute average so it's low pass filtered a bit (instantaneous bandwidth is probably much higher), but still, this is nothing:

enter image description here

CPU:

enter image description here

I/O:

enter image description here

Here's htop near the end of the test:
htop

I also captured some of the traffic using tcpdump on a different (but similar-looking) test, starting the capture when the errors started coming in:
sudo tcpdump -nSi eth0 -w /tmp/loadtest.pcap -s0 port 80

Here's the file if anyone wants to take a look at it (~20MB): https://drive.google.com/file/d/0B1NXWZBKQN6ETmg2SEFOZUsxV28/view?usp=sharing

Here's a bandwidth graph from Wireshark:

enter image description here
(Line is all packets, blue bars are TCP errors)

From my interpretation of the capture (and I'm no expert), it looks like the TCP RST flags are coming from the load testing source, not the server. So, assuming that something isn't wrong on the load testing service's side, is it safe to assume that this is the result of some sort of network management or DDOS mitigation between the load testing service and my server?

Thanks!

Nginx – Why is the web server dropping connections with a TCP reset at high load

Best Answer

Related Topic

Best Answer

Related Solutions

Nginx – Does this prove a network bandwidth bottleneck

Nginx – is it normal for NGINX to accumulate connections in writing state

Related Topic