Why is response time exploding when request frequency drops

apache-2.4high-loadubuntu-14.04


Correction: response time (%D) is μs not ms! 1

This doesn't change anything about the weirdness of this pattern but it means that it is practically way less devastating.


Why is response time inversely correlated to request frequency?

Shouldn't the server respond faster when it is less busy handling requests?

Any suggestion how to make Apache "take advantage" of less load?

enter image description here

This pattern is periodic. That means it will show up if impressions drop below about 200 requests per minute – which happens (due to natural user-activity) from late night to early morning.


The requests are very simple POSTs sending a JSON of less than 1000 characters – this JSON is stored (appended to a text file) – that's it. The reply is just "-".

The data shown in the graphs was logged with Apache itself:

LogFormat "%{%Y-%m-%d+%H:%M:%S}t %k %D %I %O" performance
CustomLog "/var/log/apache2/performance.log" performance

Best Answer

This is common behavior in data centers. The times your response time is slow corresponds to what is commonly called the Batch Window. This is a period of time when user activity is expected to be low and batch processes can be run. Backups are also done during this period. These activities can strain the resources of server and networks causing performance issues such as you see.

There are a few resources that can cause issues:

  • High CPU load. This can cause Apache to wait for a time slice to process the request.
  • High memory usage. This can flush buffers that enable Apache to serve resources without reading them from disk. It can also cause paging/swapping of Apache workers.
  • High disk activity. This can cause disk I/O activity to be queued with corresponding delays in serving content.
  • High network activity. This can cause packets to be queued for transmission, increase retries and otherwise degrade service.

I use sar to investigate issued like this. atsar can be used gather sar data into daily data files. These can be examined to see what the system behavior is like during the daytime when performance is normal, and overnight when performance is variable.

If you are monitoring the system with munin or some other system that gathers and graphs resource utilization, you may find some indicators there. I still find sar more precise.

There are tools like nice and ionice that can be applied to batch processes to minimize their impact. They are only effective for CPU or I/O issues. They are unlikely to resolve issues with Memory or Network activity.

Moving backup activity to a separate network can reduce network contention. Some backup software can be configured to limit the bandwidth that will be used. This could resolve or reduce network contention issues.

Depending on how the batch processes are triggered you may be able to limit the number of batch processes running in parallel. This may actually improve the performance of the batch processes as they are likely experiencing the same resource contention.

Related Topic