Php – Server load spikes several times a day, load average for the past month is 5 times the load average all year

apache-2.2debianlampPHP

My Munin notifications set up for our (Debian) LAMP cluster have been notifying me continuously that our load on our production machine has been at dangerous levels. While the average load all year typically runs between 2 and 8, the load in the past month and only the past month — has been skyrocketing to 10, 18, and occasionally even 50-60. The spikes last only 5-10 minutes at a time and occur about every 2-3 hours. The spikes do not effect performance only because I have a script that sends traffic off our server to a mirror CDN when the load goes above 10. I've looked for cron jobs that correlate with this timeframe but there is nothing I can see that would cause this. Site traffic is also normal (we receive about 200K visits per day). The MySQL database this web server relies upon seems to be performing normally. The load on that server is low and performance is good.

I'm also trying to think of anything I've changed around the time this problem began, and I really cannot think of anything.

This is probably not much to go on. Maybe there is a clue in the top print-out (below) that I'm not seeing.

How do I proceed to find the cause?


Typical top when the load is NOT spiking:

top - 11:13:09 up 472 days, 25 min,  1 user,  load average: 6.08, 4.29, 3.80
Tasks: 105 total,   1 running, 104 sleeping,   0 stopped,   0 zombie
Cpu(s): 41.2%us,  5.8%sy,  0.0%ni, 49.5%id,  2.7%wa,  0.1%hi,  0.7%si,  0.0%st
Mem:   3369592k total,  2166980k used,  1202612k free,   559504k buffers
Swap:  2650684k total,     1892k used,  2648792k free,  1129116k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32046 apache    15   0 36300  12m 9828 S   20  0.4   0:01.97 apache2
32679 apache    15   0 36568  13m  10m S   19  0.4   0:01.69 apache2
31441 apache    15   0 36616  13m  10m S   19  0.4   0:04.13 apache2
31477 apache    15   0 36596  13m 9.8m S   15  0.4   0:01.99 apache2
31993 apache    15   0 36876  16m  12m S   12  0.5   0:02.01 apache2
31782 apache    15   0 36836  14m  10m S    8  0.4   0:02.17 apache2
32198 apache    15   0 36536  13m  10m S    7  0.4   0:01.59 apache2
  880 apache    15   0 36508 9708 6236 S    7  0.3   0:00.42 apache2
31945 apache    17   0 36876  16m  13m S    5  0.5   0:03.17 apache2
32197 apache    16   0 36636  10m 7504 S    5  0.3   0:02.70 apache2
32326 apache    15   0 37024  11m 7632 S    5  0.3   0:02.15 apache2
32565 apache    15   0 37280  13m 9.8m S    5  0.4   0:03.75 apache2
32676 apache    15   0 36896  16m  12m S    4  0.5   0:00.95 apache2
32678 apache    15   0 36536  12m 9692 S    4  0.4   0:02.27 apache2
  974 apache    16   0 37064 9888 6016 D    4  0.3   0:00.13 apache2
32150 apache    16   0 36832  13m  10m S    3  0.4   0:01.74 apache2
31780 apache    16   0 36848  11m 7660 S    3  0.3   0:02.87 apache2

And here is a top when we are spiking:

top - 15:25:22 up 474 days,  4:37,  1 user,  load average: 78.73, 50.20, 24.79
Tasks: 250 total,   4 running, 244 sleeping,   0 stopped,   2 zombie
Cpu(s): 36.5%us,  4.7%sy,  0.0%ni, 56.4%id,  2.0%wa,  0.1%hi,  0.3%si,  0.0%st
Mem:   3369592k total,  2099904k used,  1269688k free,   553840k buffers
Swap:  2650684k total,     5104k used,  2645580k free,   729252k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27716 apache    15   0 43612  20m 9.8m S   20  0.6   0:01.95 apache2
16782 apache    16   0 39460  19m  13m R   19  0.6   0:04.61 apache2
19701 apache    15   0 39232  16m  10m S   17  0.5   0:03.18 apache2
19677 apache    16   0 39208  15m 9956 R   12  0.5   0:05.03 apache2
16760 apache    15   0 36620  16m  13m S    8  0.5   0:06.35 apache2
19798 apache    15   0 36564  13m 9988 S    6  0.4   0:02.76 apache2
20325 apache    15   0 36616  13m 9704 S    6  0.4   0:02.11 apache2
19699 apache    15   0 36860  15m  12m S    5  0.5   0:03.10 apache2
15109 apache    15   0 36624  16m  13m S    4  0.5   0:05.97 apache2
15101 apache    15   0 36592  13m  10m S    3  0.4   0:08.96 apache2
15112 apache    15   0 36612  16m  13m S    3  0.5   0:07.57 apache2
20204 apache    15   0 44612  21m 9.9m S    3  0.6   0:03.55 apache2
19624 apache    15   0 36588  13m  10m S    3  0.4   0:02.00 apache2
20151 apache    15   0 36616  16m  13m S    3  0.5   0:02.14 apache2
26252 apache    15   0 37072  13m   9m S    3  0.4   0:01.09 apache2
19805 apache    15   0 36472  16m  12m S    2  0.5   0:03.68 apache2
20163 apache    15   0 36640  13m  10m S    2  0.4   0:02.50 apache2
27260 apache    18   0 44292  20m 9328 S    2  0.6   0:02.08 apache2
29149 apache    15   0 36172  11m 8744 S    2  0.4   0:00.69 apache2
20315 apache    15   0 36360  15m  12m S    2  0.5   0:02.06 apache2
29148 apache    16   0 36184 8872 5644 S    2  0.3   0:00.08 apache2

Best Answer

Loadavg doesn't tell you much, really, about whether your system is underperforming; it's a very general metric that describes how busy your system is, where busy is defined as an index of the number of processes which are currently either executing or waiting to execute a cpu instruction. On an eight core system, where the workload is described by high-volume short-lived processes (like, say, a web server) a loadavg over 50 might not even get my attention.

Can you correlate these spikes with your apache logs to see whether response times suffer during the spike periods? Are you just serving more requests during the spikes? Do you keep stats on things like iowait and user vs system cpu, and do they correlate? The other poster who mentioned swapping is correct: swapping can cause processes to pile up as memory access slows down to disk speeds, which can lead to higher loadavg as processes hang around.

These are all things to investigate; more data, and data kept historically, can help you solve this problem. Hope this helps; good luck!