I'm trying to configure Varnish in front of Drupal, and am just running some quick Apache Bench tests to see what sort of improvement I'm getting.
The specific ab command I'm running is: ab -n 15000 -c 300 -k -H 'Accept-Encoding: gzip,deflate' www.mysite.tld/some-page
Once I set concurrency higher than 300 or so, I start to see a real long tail on results:
Percentage of the requests served within a certain time (ms)
50% 2
66% 4
75% 4
80% 5
90% 7
95% 19
98% 45
99% 246
100% 5016 (longest request)
This long tail gets worse and worse as the concurrency goes up, but always seems to occur after 95% of connections (give or take).
For the same ab command as but with c at 500:
50% 1
66% 1
75% 1
80% 2
90% 5
95% 22
98% 365
99% 5060
100% 5061 (longest request)
For c=1000:
50% 1
66% 1
75% 2
80% 2
90% 5
95% 39
98% 5061
99% 27867
100% 27885 (longest request)
I've changed these Varnish settings:
thread_pool_add_delay 1 [milliseconds]
thread_pool_max 5000 [threads]
thread_pool_min 500 [threads]
Both the varnish machine and the machine I'm using for testing are on the same subdomain, and are in fact both virtual machines on the same vmsphere. Using siege shows a similar pattern.
Why is there this long tail? How can I get rid of it?
EDIT
I see these messages in dmesg while benchmarking:
[1469125.946204] TCP: Possible SYN flooding on port 80. Sending cookies.
[1469203.735802] TCP: Possible SYN flooding on port 80. Sending cookies.
[1469276.171367] TCP: Possible SYN flooding on port 80. Sending cookies.
And I'll often see interesting things with sockets:
$ netstat -ant | grep 80 | awk '{print $6}' | sort | uniq -c | sort -n
1 TIME_WAIT
2 LISTEN
257 ESTABLISHED
437 CLOSE_WAIT
While benchmarking, usually it'll fly through requests 0 – 13500, then go very, very slow, only sometimes making it to 15000. If it doesn't make it, then connections go to CLOSE_WAIT – presumably because Apache Bench times out, so the socket isn't actually closed. But while waiting, there's usually a whole bunch of sockets just waiting at ESTABLISHED.
Is there something going on where the kernel is sending back syn cookies, which is somehow making my clients time out and try again?
EDIT 2
Although it's probably a terrible idea in practice, I did try disabling syn cookies:
sysctl -n net.ipv4.tcp_syncookies = 0
This provides a significantly better result – for c=1000, n=25000, the results look virtually the same as c=500, n=15000 above.
However, I've also come to realize that if I'm serving 3k requests/sec, then I'll actually be really close to hitting the limit of my outbound network pipe. So, I'll probably set my net parameters back to their defaults and not worry too much about my benchmark results. But I remain very curious about a) why syn cookies would cause this slowdown, when there's plenty of cpu on both machines and b) why, even with syn cookies disabled, I still see a similar result, just with higher c and n values.
Best Answer
I cannot see a problem with your benchmark result:
It means that 90% of all requests (or 13,500 of 15,000 requests) will be served within 7 ms.
I assume that the remaining 10% are requests that have been passed to the back-end (first time hit or updates). If a request has to be passed to the back-end you do not benchmark Varnish's performance but the performance of the back-end (e.g. Apache & mod_php).