Summed up question:
- Why do we see one set of servers perform significantly worse with the same database and workload as others? Symptoms other than longer execution time is low (near zero) load average, high cpu usage and specifically high %system usage.
Long description:
I have several servers located at a hosting partner, running MySQL 5.1.67 and 5.1.73, where we observed performance issues during peak hours.
What we see is Load average falling from it's usual level to nearly 0 (0.10-0.20), it can perhaps best be described with this image from New Relic
I can reproduce the problem with a captured workload (and a dump of the database) if I run it parallel enough on our Testing and Production servers, but not on any others.
I've set up an Amazon instance with same my.cnf as Testing (details at end of post) and also tried on another Linux server (LXC container) I have available and even my desktop PC. Execution time on Testing and Production is 4 minutes, where all the others take around 1 min 30 seconds and do not show this behavior where load average is low but %user and %system is high.
Vmstat shows a high run queue and a high number of context switches while the workload runs, but only on the problematic machines, sar shows no iowait:
Testing:
$ ./workload.sh & vmstat 1 10 -w procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 168896 3218240 447004 12226164 0 0 9 75 19 12 3 1 97 0 0 32 0 168896 3129304 447004 12226204 0 0 32 0 22669 357979 49 23 27 0 0 29 0 168896 3129112 447004 12226212 0 0 0 40 23365 422537 49 26 25 0 0 14 0 168896 3126188 447004 12226232 0 0 0 52 22386 456626 43 27 30 0 0 29 0 168896 3130980 447012 12226204 0 0 0 68 23028 459332 45 27 29 0 0 24 0 168896 3125212 447020 12239788 0 0 0 96 22968 367447 49 24 27 0 0 27 0 168896 3104804 447020 12259820 0 0 0 68 22830 406129 50 28 22 0 0 30 0 168896 3081740 447020 12280300 0 0 0 0 22493 423641 49 29 22 0 0 top on Testing: $ top top - 19:49:22 up 1 day, 1:15, 5 users, load average: 0.08, 0.10, 0.09 Tasks: 607 total, 1 running, 606 sleeping, 0 stopped, 0 zombie Cpu(s): 43.7%us, 18.0%sy, 0.0%ni, 38.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st sar on Testing: 08:11:04 PM CPU %user %nice %system %iowait %steal %idle 08:11:05 PM all 51.08 0.00 24.37 0.00 0.00 24.54 08:11:06 PM all 47.14 0.00 26.15 0.00 0.00 26.71
Amazon:
$ ./workload.sh & vmstat 1 10 -w [1] 10472 procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 6 0 0 14133876 30316 90372 0 0 1 1 58 79 2 0 98 0 0 14 0 0 14090268 30316 95972 0 0 0 0 16866 27910 88 10 3 0 0 34 0 0 13910708 30324 90372 0 0 0 192 13934 25824 86 9 5 0 0 1 0 0 14079724 30332 90372 0 0 0 228 10041 8075 31 2 67 0 0 2 0 0 14102296 30332 90372 0 0 0 0 10129 7601 14 2 84 0 0 28 0 0 14095320 30332 92020 0 0 0 0 19820 27951 76 8 16 0 0 32 0 0 13940612 30340 91256 0 0 0 144 20896 26666 83 11 6 0 0 1 0 0 14068780 30348 90372 0 0 0 204 13971 13457 53 4 42 0 0 26 0 0 14068696 30356 92816 0 0 0 56 18661 24165 65 8 26 0 0 16 0 0 13997072 30372 101740 0 0 0 288 14984 23034 63 9 26 2 0 top on Amazon: ]$ top top - 13:51:09 up 6:12, 2 users, load average: 6.72, 3.73, 1.69 Tasks: 256 total, 6 running, 250 sleeping, 0 stopped, 0 zombie Cpu(s): 68.8%us, 7.5%sy, 0.0%ni, 23.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Servers:
-
Production: MySQL slave (read-only) running 5.1.67, RedHat 6.4. 2 x 6 core Xeon(R) CPU E5-2630L 0 @ 2.00GHz with HyperThreading, 192GB ram (128GB innodb_buffer)
-
Testing: MySQL 5.1.73, RedHat 6.5 (recently updated to see if it would resolve the problem). 2 x 6 core Xeon(R) CPU E5-2630L 0 @ 2.00GHz with HyperThreading, 32GB ram (4192M innodb_buffer)
In addition we have the following where I do not see the problem and which executes the workload in 1m30sec compared to 4min on the two above:
-
Amazon: MySQL 5.1.73, c4x2large RedHat 6.5 – configured with sysctl.conf and my.cnf from Testing server.
-
LXC: MySQL 5.1.73, CentOS6, my.cnf from testing
- My desktop: MariaDB 5.5, Ubuntu, i7 4 core.
Best Answer
I think I know what you're getting at. Here's a scenario that could arrive at higher CPU utilization while simultaneously having lower load average. Although in all honestly having CPU at 50% should at least imply a load of 0.5. So something is amiss at a level outside your control.
That said, consider the following:
1) The virtual server has a burst/limit CPU allocation scheme similar to Amazon EC2 micro instances.
2) Your application uses enough CPU to use up the burst and is then throttled.
3) This throttle both increases the perceived CPU percentage used, while simultaneously reducing the actual application throughput.
4) The reduced application throughput means less related activities spawns (subprocesses, disk writes, ...) that mean less load created overall.