Ssl – Troubleshooting Apache failed connections

apache-2.2connectionssl

I am trying to troubleshoot some odd, intermittent connection failures with apache. I noticed the issue when users complained that parts of the web application we're hosting weren't working. Debugging revealed that AJAX requests were not returning the XML or JSON data the JavaScript application was expecting. The application is served over SSL.

When I tested myself, I would see intermittent failures, and Firebug would show that either the response length was zero, or the connection seemed to fail completely. Application logs on the server showed no problems, including when Firebug reported the response was empty — the application log on the server showed data had been sent.

On a hunch I fired up apachebench (ab) and was surprised to find some connection failures:

[jnet@Stan ~]$ ab -v 1 -n 1000 -c 10 $url
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking workingman.smart-safe-secure.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        Apache/2.2.3
Server Hostname:        workingman.smart-safe-secure.com
Server Port:            443
SSL/TLS Protocol:       TLSv1/SSLv3,DHE-RSA-AES256-SHA,1024,256

Document Path:          /
Document Length:        659 bytes

Concurrency Level:      10
Time taken for tests:   104.086 seconds
Complete requests:      1000
Failed requests:        2
   (Connect: 2, Receive: 0, Length: 0, Exceptions: 0)
Write errors:           0
Total transferred:      945000 bytes
HTML transferred:       659000 bytes
Requests per second:    9.61 [#/sec] (mean)
Time per request:       1040.855 [ms] (mean)
Time per request:       104.086 [ms] (mean, across all concurrent requests)
Transfer rate:          8.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      356  844 215.7    840    2268
Processing:    68  194 138.9    128    1483
Waiting:       67  178 122.0    116    1426
Total:        494 1039 241.8    993    2623

Percentage of the requests served within a certain time (ms)
  50%    993
  66%   1039
  75%   1101
  80%   1162
  90%   1407
  95%   1492
  98%   1626
  99%   1718
 100%   2623 (longest request)

These requests were for a static HTML page, so my PHP application doesn't seem to be the issue here. Running the tests over normal HTTP (non-ssl) produced no failures at all. I am at a loss as to what could be happening… not even sure how to troubleshoot from here. I will gladly post httpd.conf configuration, just let me know what parts would help. Server is Apache/2.2.3 (CentOS), with mpm_worker and mod_fastcgi…

UPDATE: I just had my first ab test return 2 connection failures over normal HTTP, for the same HTML page. So it looks like SSL isn't the problem after all…

UPDATE 2: It's possible this is some sort of network issue, because I am not able to replicate this using ab on a server in the same data center, nor am I able to replicate this using ab on localhost. However pinging the server in question from my workstation shows 0% packet loss… So I am unsure of what steps to take next.

UPDATE 3: In case it helps, if I run ab to benchmark over an SSH tunnel, I get no failures… so maybe this is a networking issue instead of an apache issue…

Best Answer

When you say that it works great when request are done on the same datacenter or when you use a ssh tunnel I think that it could be some kind of shaping between your remote site on the datacenter.
Like if icmp and ssh (and others) are more prioritized than http. So if the WAN like become overloaded the router can drop http traffic. Generaly SSH is prioritized because it need high interactivity while FTP has the less prioritized as it's file transfert.
Ask your network team if there is any Shaping or QOS in place.

Another thing tells me that the problem could be that is that connect time are from 356 to 2268. 356 is quite slow, I guess that when tunnel with SSH it's less than that. and a so high difference between min et max tell me that some packet are probably droped (due to QOS/Shaping) and retransmit are needed (so connect time is slower)