My Munin notifications set up for our (Debian) LAMP cluster have been notifying me continuously that our load on our production machine has been at dangerous levels. While the average load all year typically runs between 2 and 8, the load in the past month and only the past month — has been skyrocketing to 10, 18, and occasionally even 50-60. The spikes last only 5-10 minutes at a time and occur about every 2-3 hours. The spikes do not effect performance only because I have a script that sends traffic off our server to a mirror CDN when the load goes above 10. I've looked for cron jobs that correlate with this timeframe but there is nothing I can see that would cause this. Site traffic is also normal (we receive about 200K visits per day). The MySQL database this web server relies upon seems to be performing normally. The load on that server is low and performance is good.
I'm also trying to think of anything I've changed around the time this problem began, and I really cannot think of anything.
This is probably not much to go on. Maybe there is a clue in the top print-out (below) that I'm not seeing.
How do I proceed to find the cause?
—
Typical top when the load is NOT spiking:
top - 11:13:09 up 472 days, 25 min, 1 user, load average: 6.08, 4.29, 3.80
Tasks: 105 total, 1 running, 104 sleeping, 0 stopped, 0 zombie
Cpu(s): 41.2%us, 5.8%sy, 0.0%ni, 49.5%id, 2.7%wa, 0.1%hi, 0.7%si, 0.0%st
Mem: 3369592k total, 2166980k used, 1202612k free, 559504k buffers
Swap: 2650684k total, 1892k used, 2648792k free, 1129116k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32046 apache 15 0 36300 12m 9828 S 20 0.4 0:01.97 apache2
32679 apache 15 0 36568 13m 10m S 19 0.4 0:01.69 apache2
31441 apache 15 0 36616 13m 10m S 19 0.4 0:04.13 apache2
31477 apache 15 0 36596 13m 9.8m S 15 0.4 0:01.99 apache2
31993 apache 15 0 36876 16m 12m S 12 0.5 0:02.01 apache2
31782 apache 15 0 36836 14m 10m S 8 0.4 0:02.17 apache2
32198 apache 15 0 36536 13m 10m S 7 0.4 0:01.59 apache2
880 apache 15 0 36508 9708 6236 S 7 0.3 0:00.42 apache2
31945 apache 17 0 36876 16m 13m S 5 0.5 0:03.17 apache2
32197 apache 16 0 36636 10m 7504 S 5 0.3 0:02.70 apache2
32326 apache 15 0 37024 11m 7632 S 5 0.3 0:02.15 apache2
32565 apache 15 0 37280 13m 9.8m S 5 0.4 0:03.75 apache2
32676 apache 15 0 36896 16m 12m S 4 0.5 0:00.95 apache2
32678 apache 15 0 36536 12m 9692 S 4 0.4 0:02.27 apache2
974 apache 16 0 37064 9888 6016 D 4 0.3 0:00.13 apache2
32150 apache 16 0 36832 13m 10m S 3 0.4 0:01.74 apache2
31780 apache 16 0 36848 11m 7660 S 3 0.3 0:02.87 apache2
And here is a top when we are spiking:
top - 15:25:22 up 474 days, 4:37, 1 user, load average: 78.73, 50.20, 24.79
Tasks: 250 total, 4 running, 244 sleeping, 0 stopped, 2 zombie
Cpu(s): 36.5%us, 4.7%sy, 0.0%ni, 56.4%id, 2.0%wa, 0.1%hi, 0.3%si, 0.0%st
Mem: 3369592k total, 2099904k used, 1269688k free, 553840k buffers
Swap: 2650684k total, 5104k used, 2645580k free, 729252k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27716 apache 15 0 43612 20m 9.8m S 20 0.6 0:01.95 apache2
16782 apache 16 0 39460 19m 13m R 19 0.6 0:04.61 apache2
19701 apache 15 0 39232 16m 10m S 17 0.5 0:03.18 apache2
19677 apache 16 0 39208 15m 9956 R 12 0.5 0:05.03 apache2
16760 apache 15 0 36620 16m 13m S 8 0.5 0:06.35 apache2
19798 apache 15 0 36564 13m 9988 S 6 0.4 0:02.76 apache2
20325 apache 15 0 36616 13m 9704 S 6 0.4 0:02.11 apache2
19699 apache 15 0 36860 15m 12m S 5 0.5 0:03.10 apache2
15109 apache 15 0 36624 16m 13m S 4 0.5 0:05.97 apache2
15101 apache 15 0 36592 13m 10m S 3 0.4 0:08.96 apache2
15112 apache 15 0 36612 16m 13m S 3 0.5 0:07.57 apache2
20204 apache 15 0 44612 21m 9.9m S 3 0.6 0:03.55 apache2
19624 apache 15 0 36588 13m 10m S 3 0.4 0:02.00 apache2
20151 apache 15 0 36616 16m 13m S 3 0.5 0:02.14 apache2
26252 apache 15 0 37072 13m 9m S 3 0.4 0:01.09 apache2
19805 apache 15 0 36472 16m 12m S 2 0.5 0:03.68 apache2
20163 apache 15 0 36640 13m 10m S 2 0.4 0:02.50 apache2
27260 apache 18 0 44292 20m 9328 S 2 0.6 0:02.08 apache2
29149 apache 15 0 36172 11m 8744 S 2 0.4 0:00.69 apache2
20315 apache 15 0 36360 15m 12m S 2 0.5 0:02.06 apache2
29148 apache 16 0 36184 8872 5644 S 2 0.3 0:00.08 apache2
Best Answer
Loadavg doesn't tell you much, really, about whether your system is underperforming; it's a very general metric that describes how busy your system is, where busy is defined as an index of the number of processes which are currently either executing or waiting to execute a cpu instruction. On an eight core system, where the workload is described by high-volume short-lived processes (like, say, a web server) a loadavg over 50 might not even get my attention.
Can you correlate these spikes with your apache logs to see whether response times suffer during the spike periods? Are you just serving more requests during the spikes? Do you keep stats on things like iowait and user vs system cpu, and do they correlate? The other poster who mentioned swapping is correct: swapping can cause processes to pile up as memory access slows down to disk speeds, which can lead to higher loadavg as processes hang around.
These are all things to investigate; more data, and data kept historically, can help you solve this problem. Hope this helps; good luck!