Our webservers with static content are experiencing strange 3 second latencies occasionally. Typically, an ApacheBench run (> 10000 requests, concurrency 1 or 40, no difference, but keepalive off) looks like this:
Connection Times (ms) min mean[+/-sd] median max Connect: 2 10 152.8 3 3015 Processing: 2 8 34.7 3 663 Waiting: 2 8 34.7 3 663 Total: 4 19 157.2 6 3222 Percentage of the requests served within a certain time (ms) 50% 6 66% 7 75% 7 80% 7 90% 9 95% 11 98% 223 99% 225 100% 3222 (longest request)
I have tried many things:
– Apache2 2.2.9 with worker or prefork MPM, no difference (with KeepAliveTimeout 10-15)
– Nginx 0.6.32
– various tcp parameters (net.core.somaxconn=3000, net.ipv4.tcp_sack=0, net.ipv4.tcp_dsack=0)
– putting the files/DocumentRoot on tmpfs
– shorewall on or off (i.e. empty iptables or not)
– AllowOverride None is on for /, so no .htaccess checks (verified with strace)
– the problem persists whether the webservers are accessed directly or through a Foundry load balancer
Kernel is 2.6.32 (Debian Lenny backports), but it occurred with 2.6.26 also. IPv6 is enabled, but not used.
Does the issue look familiar to anyone? Help/suggestions are much appreciated. It sounds a bit like a SYN,ACK packet getting lost or ignored.
Best Answer
Capture this event with tcpdump/Wireshark/tshark. Then open the capture in Wireshark, go to Statistics->TCP stream graph->Time-sequence graph (Stevens).
This gets you a graph of sequence numbers vs time. If you have a 3 second gap in your connections, you should be able to spot it, as there should be no dots for the 3 seconds on the x-axis in between two dense groupings of dots. Click on the last dot on the left side of the gap. This takes you to the frame just before the gap happens. Usually that's the one packet containing the problem. You might see zero-window packet, packet missing, out of order delivery, dups, etc...