Random Lag in Apache before serving files


I've been trying to troubleshoot some random lag that happens in less than 3% of requests I'm sending to my server. I wrote a testing script that uses PHP's file_get_contents on a seperate server and uses timestamps all over the place to time how long things take. I've tested both static HTML and PHP files on the serving end with the same random Lag.

My issue is that randomly a static page that should take .1 seconds to retreive takes 3 seconds, and sometimes 5. On a rare occasion it takes 2 or 4 seconds.

I've investigated all sorts of issues listed below, but I can't seem to identify the issue. Through time stamps I've identified that the lag is happening before apache even serves the file. The timestamp in the apache log matches the start of the file and is very close to the time that I received the file back to my test script.

My test script sends 5 simultaneous requests to the test page and then launches 5 more when it's done processing that command.

The server that lags has no abnormal load while the script is running. CPU is down under 10% and memory has at least 1 gig free at all times. The percentage of loads that lag can be increased by increasing the simultaneous connections. During an extended test, occasionally a page is dropped completly. I tested 1 million pages and only 10 pages dropped.

The only firewall I can't disable is the one in the router. It's a fairly nice router with some built in DOS attack preventions. Would the amount of requests hitting it from the same IP possibly be triggering a DOS protocol and the router is holding the request up?

Any advice or things to look at would be appreciated. Listed below are my conclusions and comments.

1) DNS lookups are not happening. I used netstat to watch to confirm this.
2) Server resources are not used up.
3) The testing script is on a different server and network than the lagging server.
4) Firewalls should not be an issue. The same result happens with the firewall disabled.
5) I've fooled around with apache configurations increasing max clients and other settings. Nothing has helped rid the issue. If anyone has experienced this issue and has a configuration they know helps, I'd be happy to try it.
6) I've tested keep-alive on and off with no apparent difference.
7) I've tested both apache MPM's and the same lag occurs.
8) The apache log file %D value shows no increase for a page that takes .1 seconds to serve and a page that takes 3 seconds to load.

I find this really hard to believe that no-one on the internet has experienced this issue, but I've been searching for 3 days straight now and cannot find a solution. Any help would be greatly appreciated. Thank you.

Best Answer

In addition to your test script try also running a benchmark application like ApacheBench (ab) or Siege. Also run this and your test script on the local server itself. If it is a router/network issue your tests on the local server should show no issue. You can play with the concurrency request of ab/siege to see if the issue happens more at a higher concurrency. Test it with a variety of files and types of files to see if that makes any difference.

The key to most troubleshooting is replicating the issue and then narrowing down the cause. Since you describe it as "random" I would try and focus on attempting to replicate the issue as reliably as possible. If you cannot replicate the issue then you won't truly know if you actually solve it or not in the end.

It may be in fact the case that "no-one on the internet has experienced this issue". There are virtually an infinite number of hardware/software/application configurations that make diagnosing some issues like this difficult. People also may use a different description for the same problem: what you call "random Apache lag" someone else may call a "remote Ethernet performance issue".