Linux – Apache web server intermittent stalls

apache-2.2centoslinuxPHP

Our SOAP web server is running in PHP on Apache on CentOS and makes heavy use of MySQL. There is heavy demand on the server – most requests are very small and involve only two or three MySQL queries, but there are an awful lot of them – potentially a couple of hundred per second at peak times. Data traffic with each request is usually less than 1Kb, often only a few bytes.

The hardware this is running on is pretty decent, 18 cores with 32 Gb RAM, and it generally copes really well. CPU usage never really goes above 30%, physical RAM consumption never above 50%. However, every so often, the server appears to stall and Apache chokes up. This can last for around a minute before it loosens up again and normal service resumes.

I've analysed this in quite some depth to see what is going on during the stalls. Apache is maxed out on it's connections, pretty much all of which are in the 'reading' state. CPU usage drops to pretty much nothing, memory usage doesn't change, network and disk IO both plummet, so it looks like the system is just completely idle.

After doing a lot of Googling, I was led to believe this could be to do with some time out settings – network connections not being freed up quickly enough, and Apache running out. This would explain why Apache will resume normal operation after a while, it waits for them all to time out, then carries on. Doing a 'netstat -an' would support this as I do see a lot of connections in TIME_WAIT. However, I've reduced all sorts of timeout settings in the Apache configuration, and also with various net settings in sysctl.conf, but nothing appears to resolve the issue.

There is nothing at all in Apache's error logs. I've tried using 'ab' to stress test Apache – it appears to cause the intermittent stall to happen slightly sooner, but that's all I can really gauge from it. The max connections for Apache and MySQL are both set to high values – actual concurrent connections never come close except during the stall when the Apache connections max out.

I'm not really sure what else to try. Any ideas or pointers on things I might be missing here?

–edit–

A couple of extra observations. As the stall is occurring, I notice the number of connections in the ESTABLISHED state rise considerably, then the number in CLOSE_WAIT follows a few seconds later.

Also, when the stall occurs, the number of 'times the listen queue of a socket overflowed' and the 'SYNs to LISTEN sockets ignored' increases quite rapidly. During the intervals between stalls, these numbers do not change at all.

I'm not sure whether these numbers are a cause or a consequence of the stall. Any further help would be much appreciated.

Best Answer

I've now resolved this so I'm posting the solution in case others experience the same issue.

I neglected to mention that all of our web traffic goes over HTTPS, and that appears to be the cause. During a stall I used strace and pstack to see what one of the idle Apache processes was doing. It was stuck waiting on a mutex for the SSL session cache.

Looking at Apache config I noticed we had SSLSessionCache enabled with a timeout of 5 minutes. Disabling this is the fix.

My guess is that the session cache was filling up, then Apache was waiting for older sessions to time out before continuing.

Related Topic