How to trace high CPU Apache process to particular virtualhost

apache-2.2cpu-usagevirtualhost

Looking at the output of top I notice that intermittently there are one or two Apache processes consuming a high amount of CPU – anywhere between 50% – 90%

The spikes in CPU usage come and go quite quickly every 10 seconds or so.

There are various other Apache processes running which consume somewhere between 2% – 4%

I've researched all the various ways of trying to track down which virtualhost/website is responsible for these processes. However, because they come and go quickly I can't find a reliable way of doing this.

I've tried lsof and also looking at the output of server-status but because the processes don't last long the process ID gets re-used and it's not possible to tie it down to the virtualhost that's causing the issue.

For example, if I run lsof on the process ID in question, it lists a dozen different virtualhost log files which have shared that process ID in the last few seconds. I'm convinced there is one virtualhost at fault but I can't figure out which one.

I've also checked the MySQL slow query log and this doesn't reveal anything of interest.

Best Answer

My recommendation: add response time to your logs.

It's not perfect, as there's no guarantee that the spike-causing requests take longer to serve than others, but it is likely, and gives you a starting point for investigation.

To do this, you'll want to define a new LogFormat and CustomLog which includes the %D parameter. See the Apache mod_log_config documentation.

Another option which is probably a bit too low-level but could give you an idea of the nature of the load, would be to strace the apache parent process with -f to follow children, and -c to show the cpu time per-call, e.g. strace -f -c -p <apache parent pid>

Once you know the system calls that are taking the most time, you can then trace them directly. For example, say the server is spending a lot of time doing write(), you could then do strace -f -e trace=write -p <apache parent pid>, and look at those calls in more detail.