Linux – How to determine what’s causing the server’s load average to jump to 90

linuxMySQLUbuntu

Alrighty, I'm at a complete loss here. I've had this Ubuntu server running for about three years now. In the last couple months it started behaving oddly and it's only getting worse. It's a pretty busy server running around 15 websites and a number of other tools on it. It's typical 15min load avg is .3. However, its' been spiking to around 90 about every 12 hours or so.

I'm certain that is has something to do with mysql and the server somehow getting locked and apache just pilling up waiting for things to open. Here is a top when things are going crazy.

Tasks: 143 total,  20 running, 123 sleeping,   0 stopped,   0 zombie
Cpu(s): 34.3%us, 62.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  2.6%si,  0.0%st
Mem:   2061444k total,   911460k used,  1149984k free,    11156k buffers
Swap:  1421712k total,        0k used,  1421712k free,   126728k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1080 mysql     20   0  397m  59m 5892 S   18  3.0   0:37.37 mysqld
 1602 www-data  20   0  198m  26m 4948 R    7  1.3   0:08.17 apache2
 1725 www-data  20   0  189m  24m  11m R    7  1.2   0:04.33 apache2
 1719 www-data  20   0  189m  25m  12m R    7  1.2   0:03.88 apache2
 1802 www-data  20   0  192m  20m 4808 S    7  1.0   0:03.15 apache2
 1521 www-data  20   0  199m  28m 6912 R    6  1.4   0:10.15 apache2
 1530 www-data  20   0  193m  22m 5104 S    5  1.1   0:06.53 apache2
 1536 www-data  20   0  196m  25m 4936 R    5  1.2   0:07.93 apache2
 1583 www-data  20   0  186m  21m  11m R    5  1.0   0:03.46 apache2
 1722 www-data  20   0  193m  21m 4956 R    5  1.1   0:04.91 apache2
 1906 www-data  20   0  182m  12m 6724 S    5  0.6   0:00.61 apache2
 1439 root      20   0 92040 3672 2280 S    5  0.2   0:08.04 ezproxy
 1539 www-data  20   0  194m  27m 9548 R    4  1.3   0:08.08 apache2
 1716 www-data  20   0  187m  22m  11m R    4  1.1   0:03.36 apache2
 1891 www-data  20   0  183m  18m  11m S    4  0.9   0:00.61 apache2
 1498 www-data  20   0  194m  23m 6264 S    4  1.2   0:11.47 apache2
 1517 www-data  20   0  193m  22m 5212 R    4  1.1   0:06.56 apache2
 1523 www-data  20   0  190m  26m  12m S    3  1.3   0:07.61 apache2
 1761 www-data  20   0  186m  20m  10m R    2  1.0   0:02.66 apache2
 1779 www-data  20   0  184m  19m  10m R    2  0.9   0:02.69 apache2
 1711 www-data  20   0  185m  20m  11m R    2  1.0   0:03.32 apache2
 1728 www-data  20   0  182m  11m 5028 R    2  0.6   0:01.14 apache2
 1819 www-data  20   0  181m 8120 3332 S    2  0.4   0:00.49 apache2
 1886 www-data  20   0  182m  11m 6364 S    2  0.6   0:01.18 apache2
 1899 www-data  20   0  184m  18m  10m S    2  0.9   0:01.38 apache2
 1497 www-data  20   0  191m  27m  12m S    1  1.4   0:07.84 apache2
 1766 www-data  20   0  181m  10m 5016 R    1  0.5   0:01.39 apache2
 1871 www-data  20   0  184m  19m  11m R    1  1.0   0:00.98 apache2
 1563 www-data  20   0  186m  23m  13m S    1  1.2   0:07.37 apache2
 1865 www-data  20   0  184m  18m  10m S    1  0.9   0:01.56 apache2
 1494 www-data  20   0  193m  25m 8352 S    1  1.3   0:12.07 apache2
 1512 www-data  20   0  186m  23m  13m R    1  1.1   0:06.10 apache2
 1526 www-data  20   0  186m  24m  13m R    1  1.2   0:06.30 apache2
 1816 www-data  20   0  184m  18m  10m S    1  0.9   0:01.60 apache2
 1516 www-data  20   0  184m  19m  11m S    1  1.0   0:04.12 apache2

Right now, things are running calmly,

Uptime: 241264  Threads: 1  Questions: 1870412  Slow queries: 1354  Opens: 13818  Flush tables: 1  Open tables: 256  Queries per second avg: 7.752

Here is all of my db sizes in MB

name1   14.78335094
name2   11.08541870
name3   31.01449203
name4   6.24377346
name5   0.36655807
name6   10.95312500
information_schema  0.00781250
mysql   0.60296535
name7   2.19595051
name8   1.82343006
name9   20.51372623
name0   59.42693043

I checked the slow query log but when the lockup happens every query is dumped into the slow query log. I haven't been in the server when it happens to run a proccesslist. Is there anything else I can do besides that?

Update: Here is the output from the tuning-primer.sh script: https://gist.github.com/913565

Update: Here is an IOStat during a freakout:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               5.25         6.05       106.35    3090763   54314928

And a vmstat 3: https://gist.github.com/913565#file_vmstat%203

Now with more SAR! https://gist.github.com/913565#file_sar

Thanks for the help.

Best Answer

Try installing sar and running it in the background. You may have a disk load which is spiking. sar will let you see what resources have the heaviest loads when thing go wrong like this.

You high sys load may indicate that you have a lot of I/O happening. This may be a result of natural growth of the database. Do you have an archiving process in place, to remove old data from the databases? If not you will reach a point where data required for table scans no longer fits in memory. When this happens performance will tank suddenly and significantly. The slow queries log may include some queries which can be improved by the addition of an index.

If you have another system that can you run munin on, you may want to install munin-node on the server. This will give you graphical output of some of the data available from sar. Check on the graphs every so often to see if things are changing.

EDIT: It looks like you may have a memory leak in some code running under apache. Try setting MaxRequestsPerChild to around 100 and restarting apache. If that fixes your problem, try to find your memory leak.