Linux – No apparent reason for high load average

amazon ec2high-loadiostatlinuxtop

We have several web servers running on Amazon (ec2) c1.xlarge, over Amazon AMI.

The servers are duplicates of each other, running the exact same hardware and software.
Each server spec is:

  • 7 GB of memory
  • 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
  • 1690 GB of instance storage
  • 64-bit platform
  • I/O Performance: High
  • API name: c1.xlarge

A couple of weeks ago we have run a yum upgrade on one of the servers. Starting on this upgrade the upgraded server started showing a high load average.
Needless to say, we did not update the other servers and we can not do so until we understand the reason for this behavior.

The strange thing is that when we compare the servers using top or iostat, we can not find the reason for the high load.
Note that we have moved traffic from the "problematic" server to the others, which have made the "problematic" server less crowded in terms of requests, and still his load is higher.

Do you have any idea what could it be, or where else can we check?

#
# proper server
# w command
#
 00:42:26 up 2 days, 19:54,  2 users,  load average: 0.41, 0.48, 0.49
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
      pts/1    82.80.137.29     00:28   14:05   0.01s  0.01s -bash
      pts/2    82.80.137.29     00:38    0.00s  0.02s  0.00s w


#
# proper server
# iostat command
#
Linux 3.2.12-3.2.4.amzn1.x86_64   _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.03    0.02    4.26    0.17    0.13   86.39

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdap1            1.63         1.50        55.00     367236   13444008
xvdfp1            4.41        45.93        70.48   11227226   17228552
xvdfp2            2.61         2.01        59.81     491890   14620104
xvdfp3            8.16        14.47        94.23    3536522   23034376
xvdfp4            0.98         0.79        45.86     192818   11209784


#
# problematic server
# w command
#
 00:43:26 up 2 days, 21:52,  2 users,  load average: 1.35, 1.10, 1.17
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
      pts/0    82.80.137.29     00:28   15:04   0.02s  0.02s -bash
      pts/1    82.80.137.29     00:38    0.00s  0.05s  0.00s w


#
# problematic server
# iostat command
#
Linux 3.2.20-1.29.6.amzn1.x86_64          _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.97    0.04    3.43    0.19    0.07   88.30

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdap1            2.10         1.49        76.54     374660   19253592
xvdfp1            5.64        40.98        85.92   10308946   21612112
xvdfp2            3.97         4.32        93.18    1087090   23439488
xvdfp3           10.87        30.30       115.14    7622474   28961720
xvdfp4            1.12         0.28        65.54      71034   16487112

#
# sar -q proper server
#
Linux 3.2.12-3.2.4.amzn1.x86_64 (***.com)        07/01/2012      _x86_64_        (8 CPU)

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM        13       194      0.41      0.47      0.51
12:20:01 AM         7       188      0.26      0.39      0.49
12:30:01 AM         9       198      0.64      0.49      0.49
12:40:01 AM         9       194      0.50      0.48      0.48
12:50:01 AM         7       191      0.44      0.36      0.41
01:00:01 AM        10       195      0.76      0.64      0.51
01:10:01 AM         7       175      0.41      0.58      0.56
01:20:01 AM         8       183      0.38      0.42      0.49
01:30:01 AM         8       186      0.43      0.38      0.44
01:40:01 AM         8       178      0.58      0.46      0.43
01:50:01 AM         9       185      0.47      0.45      0.45
02:00:01 AM         9       184      0.38      0.47      0.48
02:10:01 AM        10       184      0.50      0.51      0.50
02:20:01 AM        13       200      0.37      0.45      0.48
Average:            9       188      0.47      0.47      0.48

02:28:42 AM       LINUX RESTART

02:30:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:40:01 AM         9       151      0.55      0.55      0.37
02:50:01 AM         7       163      0.54      0.48      0.42
03:00:01 AM         9       164      0.35      0.43      0.42
03:10:01 AM        10       168      0.31      0.36      0.40
03:20:01 AM         8       170      0.27      0.34      0.39
03:30:01 AM         8       167      0.50      0.55      0.48
03:40:01 AM         8       153      0.22      0.36      0.43
03:50:01 AM         7       165      0.38      0.38      0.41
04:00:01 AM         8       169      0.70      0.45      0.42
04:10:01 AM         8       160      0.58      0.46      0.43
04:20:01 AM         8       166      0.31      0.35      0.40
04:30:01 AM         9       166      0.17      0.33      0.38
04:40:01 AM         9       159      0.13      0.29      0.37
04:50:01 AM        12       170      0.36      0.28      0.32
05:00:01 AM         7       162      0.16      0.22      0.28
05:10:01 AM         6       163      0.51      0.43      0.36
05:20:01 AM         8       162      0.50      0.45      0.41
05:30:01 AM        10       170      0.30      0.32      0.36
05:40:01 AM         7       167      0.37      0.32      0.33
05:50:01 AM         8       166      0.48      0.44      0.38
06:00:01 AM        12       177      0.41      0.41      0.40
06:10:01 AM         8       166      0.47      0.44      0.42
06:20:01 AM         9       177      0.32      0.38      0.40
06:30:01 AM         5       166      0.29      0.37      0.40
06:40:01 AM         8       165      0.57      0.41      0.40
Average:            8       165      0.39      0.39      0.39


#
# sar -q problematic server
#
Linux 3.2.20-1.29.6.amzn1.x86_64 (***.com)       07/01/2012      _x86_64_        (8 CPU)

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM        12       194      1.20      1.19      1.28
12:20:01 AM         7       200      0.95      1.26      1.34
12:30:01 AM        11       199      1.16      1.23      1.30
12:40:01 AM         7       200      0.96      1.03      1.18
12:50:01 AM         8       208      1.42      1.17      1.16
01:00:02 AM         8       201      0.91      1.09      1.16
01:10:01 AM         7       200      1.08      1.15      1.19
01:20:01 AM         9       200      1.45      1.25      1.23
01:30:01 AM        11       195      0.97      1.10      1.19
01:40:01 AM         7       188      0.78      1.05      1.16
01:50:01 AM         9       196      1.32      1.22      1.24
02:00:01 AM        12       206      0.96      1.17      1.22
02:10:01 AM         9       187      0.96      1.09      1.17
Average:            9       198      1.09      1.15      1.22

02:23:22 AM       LINUX RESTART

02:30:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:40:01 AM         9       160      1.12      1.16      0.87
02:50:01 AM         9       163      0.77      0.94      0.91
03:00:01 AM         7       162      1.03      1.10      1.03
03:10:01 AM         9       164      0.99      1.07      1.05
03:20:01 AM         8       171      1.08      1.11      1.07
03:30:01 AM         8       167      1.02      0.99      1.02
03:40:01 AM         5       158      1.20      1.06      1.05
03:50:01 AM         8       171      1.11      1.10      1.07
04:00:01 AM         7       162      1.12      1.10      1.10
04:10:01 AM         9       164      0.90      0.94      1.02
04:20:01 AM         7       169      0.90      1.08      1.10
04:30:01 AM        13       169      1.07      1.07      1.10
04:40:01 AM        11       166      0.95      1.12      1.13
04:50:01 AM         7       173      1.04      1.12      1.13
05:00:01 AM         7       166      1.26      1.20      1.19
05:10:01 AM        10       169      1.14      1.25      1.22
05:20:01 AM        10       170      0.98      1.12      1.19
05:30:01 AM        10       166      0.82      0.98      1.09
05:40:01 AM        11       171      1.18      1.16      1.11
05:50:01 AM        12       187      1.07      1.19      1.16
06:00:01 AM         9       171      1.27      1.17      1.16
06:10:01 AM         7       169      1.40      1.26      1.22
06:20:01 AM         8       171      0.91      1.12      1.19
06:30:01 AM         8       172      1.00      1.11      1.17
06:40:01 AM         9       177      1.02      1.10      1.15
Average:            9       168      1.05      1.10      1.10

Best Answer

AWS overcontend their VM servers; they're assuming that not everyone will be consuming all the resources allocated to them, and so Amazon can make more money per unit of hardware deployed. Thus you can have two otherwise-identical systems running with wildly different performance patterns. The correlation with the upgrade is likely to be a coincidence.

A note on your diagnostic data: you really want the output of sar -q to help you diagnose this sort of problem. iostat is really only examining a very small portion of the possible sources of the issue.