I am trying to troubleshoot some odd, intermittent connection failures with apache. I noticed the issue when users complained that parts of the web application we're hosting weren't working. Debugging revealed that AJAX requests were not returning the XML or JSON data the JavaScript application was expecting. The application is served over SSL.
When I tested myself, I would see intermittent failures, and Firebug would show that either the response length was zero, or the connection seemed to fail completely. Application logs on the server showed no problems, including when Firebug reported the response was empty — the application log on the server showed data had been sent.
On a hunch I fired up apachebench (ab
) and was surprised to find some connection failures:
[jnet@Stan ~]$ ab -v 1 -n 1000 -c 10 $url
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking workingman.smart-safe-secure.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software: Apache/2.2.3
Server Hostname: workingman.smart-safe-secure.com
Server Port: 443
SSL/TLS Protocol: TLSv1/SSLv3,DHE-RSA-AES256-SHA,1024,256
Document Path: /
Document Length: 659 bytes
Concurrency Level: 10
Time taken for tests: 104.086 seconds
Complete requests: 1000
Failed requests: 2
(Connect: 2, Receive: 0, Length: 0, Exceptions: 0)
Write errors: 0
Total transferred: 945000 bytes
HTML transferred: 659000 bytes
Requests per second: 9.61 [#/sec] (mean)
Time per request: 1040.855 [ms] (mean)
Time per request: 104.086 [ms] (mean, across all concurrent requests)
Transfer rate: 8.87 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 356 844 215.7 840 2268
Processing: 68 194 138.9 128 1483
Waiting: 67 178 122.0 116 1426
Total: 494 1039 241.8 993 2623
Percentage of the requests served within a certain time (ms)
50% 993
66% 1039
75% 1101
80% 1162
90% 1407
95% 1492
98% 1626
99% 1718
100% 2623 (longest request)
These requests were for a static HTML page, so my PHP application doesn't seem to be the issue here. Running the tests over normal HTTP (non-ssl) produced no failures at all. I am at a loss as to what could be happening… not even sure how to troubleshoot from here. I will gladly post httpd.conf configuration, just let me know what parts would help. Server is Apache/2.2.3 (CentOS), with mpm_worker and mod_fastcgi…
UPDATE: I just had my first ab test return 2 connection failures over normal HTTP, for the same HTML page. So it looks like SSL isn't the problem after all…
UPDATE 2: It's possible this is some sort of network issue, because I am not able to replicate this using ab
on a server in the same data center, nor am I able to replicate this using ab
on localhost. However pinging the server in question from my workstation shows 0% packet loss… So I am unsure of what steps to take next.
UPDATE 3: In case it helps, if I run ab
to benchmark over an SSH tunnel, I get no failures… so maybe this is a networking issue instead of an apache issue…
Best Answer
When you say that it works great when request are done on the same datacenter or when you use a ssh tunnel I think that it could be some kind of shaping between your remote site on the datacenter.
Like if icmp and ssh (and others) are more prioritized than http. So if the WAN like become overloaded the router can drop http traffic. Generaly SSH is prioritized because it need high interactivity while FTP has the less prioritized as it's file transfert.
Ask your network team if there is any Shaping or QOS in place.
Another thing tells me that the problem could be that is that connect time are from 356 to 2268. 356 is quite slow, I guess that when tunnel with SSH it's less than that. and a so high difference between min et max tell me that some packet are probably droped (due to QOS/Shaping) and retransmit are needed (so connect time is slower)