Linux – Apache performance degrades dramatically above ~256 simultaneous requests

apache-2.2linuxmpm-preforkperformance

I'm running a relatively low-traffic site that experiences a large spike in visitors once a week following a site update. During this spike, site performance is extremely poor compared to the rest of the week. The actually load on the servers remains very low, reliably under 10% CPU and under 30% RAM (the hardware should be complete overkill for what we're actually doing), but for some reason Apache seems to be unable to cope with the quantity of requests. We are running apache 2.2.3 on RHEL 5.7, kernel 2.6.18-274.7.1.el5, x86_64.

Attempting to reproduce this behavior during off hours with ab, I'm finding a major drop off in performance when exceeding approximately 256 users. Running the test with the smallest possible use case I could come up with (static text file being retrieved, 223 bytes total) performance is consistently normal with 245 simultaneous requests:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       15   25   5.8     24      37
Processing:    15   65  22.9     76      96
Waiting:       15   64  23.0     76      96
Total:         30   90  27.4    100     125

Percentage of the requests served within a certain time (ms)
  50%    100
  66%    108
  75%    111
  80%    113
  90%    118
  95%    120
  98%    122
  99%    123
 100%    125 (longest request)

But as soon as I ratchet up to 265 simultaneous requests, a subset of them start taking an absurd amount of time to complete:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       13  195 692.6     26    3028
Processing:    15   65  21.3     72     100
Waiting:       15   65  21.3     71      99
Total:         32  260 681.7    101    3058

Percentage of the requests served within a certain time (ms)
  50%    101
  66%    108
  75%    112
  80%    116
  90%    121
  95%   3028
  98%   3040
  99%   3044
 100%   3058 (longest request)

These results are very consistent across multiple runs. Since there is other traffic going to that box, I'm not sure precisely where the hard cutoff would be, if there is one, but it seems to be suspiciously close to 256.

Naturally, I assumed that this was being caused by the thread limit in prefork, so I went ahead and adjusted the configuration to double the number of available threads and to prevent the pool of threads from growing and shrinking unneccessarily:

<IfModule prefork.c>
StartServers     512
MinSpareServers  512
MaxSpareServers  512
ServerLimit      512
MaxClients       512
MaxRequestsPerChild  5000
</IfModule>

mod_status confirms that I am now running with 512 available threads

8 requests currently being processed, 504 idle workers

However, attempting 265 simultaneous requests still yields almost identical results to before

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       25  211 714.7     31    3034
Processing:    17   94  28.6    103     138
Waiting:       17   93  28.5    103     138
Total:         57  306 700.8    138    3071

Percentage of the requests served within a certain time (ms)
  50%    138
  66%    145
  75%    150
  80%    161
  90%    167
  95%   3066
  98%   3068
  99%   3068
 100%   3071 (longest request)

After scouring the documentation (and Stack Exchange) I'm at a loss for further configuration settings to attempt to address this bottleneck. Is there something that I'm missing? Should I start looking for answers outside of apache? Has anyone else seen this behavior? Any help would be greatly appreciated.

EDIT:

As per Ladadadada's advice, I ran strace against apache. I tried with -tt and -T a few times and couldn't find anything out of the ordinary. I then tried running strace -c against all currently running apache processes, and got this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 22.09    0.317836           5     62128      4833 open
 19.91    0.286388           4     65374      1896 lstat
 13.06    0.187854           0    407433           pread
 10.70    0.153862           6     27076           semop
  7.88    0.113343           3     38598           poll
  6.86    0.098694           1    100954     14380 read

(…abdridged)

If I'm reading this right (and bear with me, as I don't use strace very often) none of the system calls can account for the amount of time that these requests are taking. It almost looks like the bottleneck is occurring before the requests even get to the worker threads.

EDIT 2:

As several people suggested, I ran the test again on the web server itself (previously the test was run from a neutral internet location). The results were surprising:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   11   6.6     12      21
Processing:     5  247 971.0     10    4204
Waiting:        3  245 971.3      7    4204
Total:         16  259 973.3     21    4225

Percentage of the requests served within a certain time (ms)
  50%     21
  66%     23
  75%     24
  80%     24
  90%     26
  95%   4225
  98%   4225
  99%   4225
 100%   4225 (longest request)

The bottom-line time is similar to the internet-based test, but appears to consistently be a bit worse when run locally. More interestingly, the profile has changed dramatically. Whereas before the bulk of the long-running requests' time was spent in "connect" now the bottleneck appears to be in either processing or waiting. I'm left to suspect that this may actually be a separate issue which was previously being masked by network limitations.

Running the test again from another machine on the same local network as the Apache host, I'm seeing much more reasonable results:

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    2   0.8      2       4
Processing:    13  118  99.8    205     222
Waiting:       13  118  99.7    204     222
Total:         15  121  99.7    207     225

Percentage of the requests served within a certain time (ms)
  50%    207
  66%    219
  75%    220
  80%    221
  90%    222
  95%    224
  98%    224
  99%    225
 100%    225 (longest request)

These two tests together raise a number of questions, but separately from that, there is now a compelling case to be made for some kind of severe network bottleneck happening under a certain amount of load. I think that next steps will be investigating the network layer separately.

Best Answer

What I would do in this situation is run

strace -f -p <PID> -tt -T -s 500 -o trace.txt

on one of your Apache processes during the ab test until you capture one of the slow responses. Then have a look through trace.txt.

The -tt and -T options give you timestamps of the start and duration of each system call to help identify the slow ones.

You might find a single slow system call such as open() or stat() or you might find a quick call with (possibly multiple) poll() calls directly after it. If you find one that's operating on a file or network connection (quite likely) look backwards through the trace until you find that file or connection handle. The earlier calls on that same handle should give you an idea of what the poll() was waiting for.


Good idea looking at the -c option. Did you ensure that the Apache child you were tracing served at least one of the slow requests during that time? (I'm not even sure how you would do this apart from running strace simultaneously on all children.)

Unfortunately, strace doesn't give us the complete picture of what a running program is doing. It only tracks system calls. A lot can happen inside a program that doesn't require asking the kernel for anything. To figure out if this is happening, you can look at the timestamps of the start of each system call. If you see significant gaps, that's where the time is going. This isn't easily greppable and there are always small gaps between the system calls anyway.

Since you said the CPU usage stays low, it's probably not excessive things happening in between system calls but it's worth checking.


Looking more closely at the output from ab:

The sudden jump in the response times (looks like there are no response times anywhere between 150ms and 3000ms) suggests that there is a specific timeout happening somewhere that gets triggered above around 256 simultaneous connections. A smoother degradation would be expected if you were running out of RAM or CPU cycles normal IO.

Secondly, the slow ab response shows that the 3000ms were spent in the connect phase. Nearly all of them took around 30ms but 5% took 3000ms. This suggests that the network is the problem.

Where are you running ab from? Can you try it from the same network as the Apache machine?

For more data, try running tcpdump at both ends of the connection (preferably with ntp running at both ends so you can sync the two captures up.) and look for any tcp retransmissions. Wireshark is particularly good for analysing the dumps because it highlights tcp retransmissions in a different colour, making them easy to find.

It might also be worth looking at the logs of any network devices you have access to. I recently ran into a problem with one of our firewalls where it could handle the bandwidth in terms of kb/s but it couldn't handle the number of packets per second it was receiving. It topped out at 140,000 packets per second. Some quick maths on your ab run leads me to believe you would have been seeing around 13,000 packets per second (ignoring the 5% of slow requests). Maybe this is the bottleneck you have reached. The fact that this happens around 256 might be purely a coincidence.