Linux – Low load average, but high %user and %system cpu usage

linuxMySQL

Summed up question:

  • Why do we see one set of servers perform significantly worse with the same database and workload as others? Symptoms other than longer execution time is low (near zero) load average, high cpu usage and specifically high %system usage.

Long description:
I have several servers located at a hosting partner, running MySQL 5.1.67 and 5.1.73, where we observed performance issues during peak hours.

What we see is Load average falling from it's usual level to nearly 0 (0.10-0.20), it can perhaps best be described with this image from New Relic

enter image description here

I can reproduce the problem with a captured workload (and a dump of the database) if I run it parallel enough on our Testing and Production servers, but not on any others.

I've set up an Amazon instance with same my.cnf as Testing (details at end of post) and also tried on another Linux server (LXC container) I have available and even my desktop PC. Execution time on Testing and Production is 4 minutes, where all the others take around 1 min 30 seconds and do not show this behavior where load average is low but %user and %system is high.

Vmstat shows a high run queue and a high number of context switches while the workload runs, but only on the problematic machines, sar shows no iowait:

Testing:

$ ./workload.sh & vmstat 1 10 -w
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
 r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
1  0     168896    3218240     447004   12226164    0    0     9    75   19   12   3  1  97  0  0
32  0     168896    3129304     447004   12226204    0    0    32     0 22669 357979  49 23  27      0  0
29  0     168896    3129112     447004   12226212    0    0     0    40 23365 422537  49 26  25  0  0
14  0     168896    3126188     447004   12226232    0    0     0    52 22386 456626  43 27  30  0  0
29  0     168896    3130980     447012   12226204    0    0     0    68 23028 459332  45 27  29  0  0
24  0     168896    3125212     447020   12239788    0    0     0    96 22968 367447  49 24  27  0  0
27  0     168896    3104804     447020   12259820    0    0     0    68 22830 406129  50 28  22  0  0
30  0     168896    3081740     447020   12280300    0    0     0     0 22493 423641  49 29  22  0  0

top on Testing:
$ top
top - 19:49:22 up 1 day,  1:15,  5 users,  load average: 0.08, 0.10, 0.09
Tasks: 607 total,   1 running, 606 sleeping,   0 stopped,   0 zombie
Cpu(s): 43.7%us, 18.0%sy,  0.0%ni, 38.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

sar on Testing:
08:11:04 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
08:11:05 PM     all     51.08      0.00     24.37      0.00      0.00     24.54
08:11:06 PM     all     47.14      0.00     26.15      0.00      0.00     26.71

Amazon:

$ ./workload.sh & vmstat 1 10 -w
[1] 10472
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
 r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
 6  0          0   14133876      30316      90372    0    0     1     1   58   79   2  0  98  0  0
14  0          0   14090268      30316      95972    0    0     0     0 16866 27910  88 10   3  0  0
34  0          0   13910708      30324      90372    0    0     0   192 13934 25824  86  9   5  0  0
 1  0          0   14079724      30332      90372    0    0     0   228 10041 8075  31  2  67  0  0
 2  0          0   14102296      30332      90372    0    0     0     0 10129 7601  14  2  84  0  0
28  0          0   14095320      30332      92020    0    0     0     0 19820 27951  76  8  16  0  0
32  0          0   13940612      30340      91256    0    0     0   144 20896 26666  83 11   6  0  0
 1  0          0   14068780      30348      90372    0    0     0   204 13971 13457  53  4  42  0  0
26  0          0   14068696      30356      92816    0    0     0    56 18661 24165  65  8  26  0  0
16  0          0   13997072      30372     101740    0    0     0   288 14984 23034  63  9  26  2  0

top on Amazon:

]$ top
top - 13:51:09 up  6:12,  2 users,  load average: 6.72, 3.73, 1.69
Tasks: 256 total,   6 running, 250 sleeping,   0 stopped,   0 zombie
Cpu(s): 68.8%us,  7.5%sy,  0.0%ni, 23.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Servers:

  • Production: MySQL slave (read-only) running 5.1.67, RedHat 6.4. 2 x 6 core Xeon(R) CPU E5-2630L 0 @ 2.00GHz with HyperThreading, 192GB ram (128GB innodb_buffer)

  • Testing: MySQL 5.1.73, RedHat 6.5 (recently updated to see if it would resolve the problem). 2 x 6 core Xeon(R) CPU E5-2630L 0 @ 2.00GHz with HyperThreading, 32GB ram (4192M innodb_buffer)

In addition we have the following where I do not see the problem and which executes the workload in 1m30sec compared to 4min on the two above:

  • Amazon: MySQL 5.1.73, c4x2large RedHat 6.5 – configured with sysctl.conf and my.cnf from Testing server.

  • LXC: MySQL 5.1.73, CentOS6, my.cnf from testing

  • My desktop: MariaDB 5.5, Ubuntu, i7 4 core.

Best Answer

I think I know what you're getting at. Here's a scenario that could arrive at higher CPU utilization while simultaneously having lower load average. Although in all honestly having CPU at 50% should at least imply a load of 0.5. So something is amiss at a level outside your control.

That said, consider the following:

1) The virtual server has a burst/limit CPU allocation scheme similar to Amazon EC2 micro instances.

2) Your application uses enough CPU to use up the burst and is then throttled.

3) This throttle both increases the perceived CPU percentage used, while simultaneously reducing the actual application throughput.

4) The reduced application throughput means less related activities spawns (subprocesses, disk writes, ...) that mean less load created overall.