Linux – Apache performance issue, after “1000 total children” Apache no longer responds to HTTP requests. Not MaxClients issue

apache-2.2linuxperformancetuning

Hoping someone can point me in the right direction as I've spent the last week trying to figure out where the "issue" is but haven't been able to, tried posting to the Apache users mailing lists but wanted to bounce it off here as well.

Running Apache 2.2.3 mod_php on CentOS 5.8.

At the same time every day when traffic is heavy we are having an issue where Apache no longer responds to any HTTP requests.

It sounded like a standard MaxClients being reached issue, but that doesn't seem to be the case.

Also, logging into the machine during this time the load average is under 1, and there is still plenty of RAM available.

Reviewing /var/log/httpd/error_log I've noticed the following patterns:

[Mon Apr 30 07:00:34 2012] [info] server seems busy, (you may need to increaseStartServers, or Min/MaxSpareServers), spawning 32 children, there are 0 idle, and 905 total children
[Mon Apr 30 07:00:35 2012] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 0 idle, and 937 total children
[Mon Apr 30 07:00:36 2012] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 0 idle, and 969 total children
[Mon Apr 30 07:00:37 2012] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 35 idle, and 1001 total children

[Mon Apr 30 07:00:42 2012] [debug] mpm_common.c(663): (70007)The timeout specified has expired: connect to listener on [::]:80 <br>
[Mon Apr 30 07:00:49 2012] [debug] mpm_common.c(663): (70007)The timeout specified has expired: connect to listener on [::]:80 <br>
[Mon Apr 30 07:00:56 2012] [debug] mpm_common.c(663): (70007)The timeout specified has expired: connect to listener on [::]:80 <br>
[Mon Apr 30 07:01:03 2012] [debug] mpm_common.c(663): (70007)The timeout specified has expired: connect to listener on [::]:80 <br>

A few times a day, right after 1000 total children Apache stops responding and has to be restarted in order to work again.

I've reviewed the error_log from a few weeks back and it's the same pattern, the server hits 1000 total children and then immediately spits out the
[debug] mpm_common.c(663): (70007)The timeout specified has expired: connect to listener on [::]:80 error message and stops responding.

Yet the load on the server is quite low…
Even if I try and request a simple index.html file it times out.

Here is the relevant section from the config:

Timeout 45
KeepAlive On
MaxKeepAliveRequests 10000
KeepAliveTimeout 3

<IfModule prefork.c>
StartServers      80
MinSpareServers   50
MaxSpareServers  120
ServerLimit     3500
MaxClients      3500
MaxRequestsPerChild  2000
</IfModule>

Anyone know why the magic number of children Apache is able to reach is 1000 before it stops processing more requests?

Or how to make sense of the (70007)The timeout specified has expired: connect to listener on [::]:80 message?

What "timeout specified" is it referring to?

I've double check Max Open Files, it previously was at 1024 but now its at 16384, still the same problem.

Best Answer

It is a long shot, but I have had problems like this one. I don't remember exactly what error message it was, but the reason for the problem has always been a buggy PHP program which created recursive requests (i.e. the program requests a URL which in turn re-requests the same URL etc). I have seen this for example in connection with ErrorDocument settings, where the document which should have handled the error was buggy or non-existing and which triggered an error.

You can easily verify if this could be the problem in your access.log: You should have plenty of requests from your server's IP address, all in a very short time. This works until you hit the MaxClients setting or until your system runs out of resources. The only fix is to fix the PHP program in question.