Linux – What’s going on with the server? High load, lots of idle CPU time, low disk utilization

cpu-usageemaillinuxPHP

I run a web site and send a legitimate opt-in, daily email newsletter to subscribers. Both the web hosting and email sending are done by the same machine.

I have about 100,000 subscribers who have opted in to my daily email newsletter. My PHP script did a pretty good job sending mail to all of them until fairly recently, but as the list has grown I can't keep up.

When I run top, I have very high load–usually at least 6 or 7, sometimes as high as 15–even though I only have two CPUs. However, when I run sar, my CPU is idle an average of about 30% of the time. So, it seems I'm not CPU bound. When I run iostat, it seems as though I'm not disk bound because my %util for each device is very low (no more than 5%).

Given that I don't seem to be CPU bound or disk bound, why is top reporting such high load?

Additionally, since I don't seem to be CPU bound or disk bound, why is my email sending script not able to keep up?


Here's what I see when running top:

top - 11:33:28 up 74 days, 18:49,  2 users,  load average: 7.65, 8.79, 8.28
Tasks: 168 total,   5 running, 162 sleeping,   0 stopped,   1 zombie
Cpu(s): 38.9%us, 58.6%sy,  0.8%ni,  0.0%id,  0.7%wa,  0.2%hi,  0.8%si,  0.0%st
Mem:   3083012k total,  2144436k used,   938576k free,   281136k buffers
Swap:  2048248k total,    39164k used,  2009084k free,  1470412k cached

Here's what I see when running iostat -mx:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.80    1.20   55.24    0.37    0.00    8.38

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.19    71.70  1.59 29.45     0.02     0.07     5.90     0.55   17.82   1.16   3.59
sda1              0.00     0.00  0.00  0.00     0.00     0.00     7.10     0.00   13.80  13.72   0.00
sda2              0.05    50.45  1.13 24.57     0.01     0.29    24.25     0.35   13.43   1.15   2.97
sda3              0.05    10.17  0.20  2.33     0.01     0.05    43.75     0.05   20.96   2.45   0.62
sda4              0.00     0.00  0.00  0.00     0.00     0.00     2.00     0.00   70.50  70.50   0.00
sda5              0.07     0.22  0.03  0.07     0.00     0.00    32.84     0.08  856.19   8.03   0.08
sda6              0.02     5.45  0.03  0.72     0.00     0.02    67.55     0.02   26.72   5.26   0.39
sda7              0.00     1.56  0.00  0.42     0.00     0.01    38.04     0.00    8.88   5.84   0.24
sda8              0.01     3.84  0.20  1.35     0.00     0.02    28.55     0.05   31.90   4.08   0.63

Here's what I see when running sar:

09:40:02 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:50:01 AM       all     30.59      1.01     49.80      0.23      0.00     18.37
10:00:08 AM       all     31.73      0.92     51.66      0.13      0.00     15.55
10:10:06 AM       all     30.43      0.99     48.94      0.26      0.00     19.38
10:20:01 AM       all     29.58      1.00     47.76      0.25      0.00     21.42
10:30:01 AM       all     29.37      1.02     47.30      0.18      0.00     22.13
10:40:06 AM       all     32.50      1.01     52.94      0.16      0.00     13.39
10:50:01 AM       all     30.49      1.00     49.59      0.15      0.00     18.77
11:00:01 AM       all     29.43      0.99     47.71      0.17      0.00     21.71
11:10:07 AM       all     30.26      0.93     49.48      0.83      0.00     18.50
11:20:02 AM       all     29.83      0.81     48.51      1.32      0.00     19.52
11:30:06 AM       all     31.18      0.88     51.33      1.15      0.00     15.47
Average:          all     26.21      1.15     42.62      0.48      0.00     29.54

Here are the top handful of processes listed at the particular time I happened to run top -c:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                      
 8180 mysql     16   0 57448  19m 2948 S 26.6  0.7   4702:26 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/bristno.pid --skip-external-locking                          
26956 brristno  17   0     0    0    0 Z  8.0  0.0   0:00.24 [php] <defunct>                                                                                                                                                               
26958 brristno  17   0 94408  43m  37m R  5.0  1.4   0:00.15 /usr/bin/php /home/brristno/public_html/dbv.php                                                                                                                               
22852 nobody    16   0  9628 2900 1524 S  0.7  0.1   0:00.17 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
 8591 brristno  34  19 96896  13m 6652 S  0.3  0.4   0:29.82 /usr/local/bin/php /home/brristno/bin/mailer.php 1qwqyb6 i0gbor                                                                                                               
24469 nobody    16   0  9628 2880 1508 S  0.3  0.1   0:00.08 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
25495 nobody    15   0  9628 2876 1500 S  0.3  0.1   0:00.06 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                    
26149 nobody    15   0  9628 2864 1504 S  0.3  0.1   0:00.04 /usr/local/apache/bin/httpd -k start -DSSL      

Thank you, Dmitri!

1) I already have a script that unsubscribes email addresses that have bounced at least five times in the past month, so hopefully that is keeping my list relatively limited to active email addresses.

2) I am using exim 4.69. My config file is at

/etc/exim.conf

and my log files are at:

/var/log/exim_mainlog
/var/log/exim_paniclog
/var/log/exim_rejectlog

Additionally, when I look in /etc/syslog.conf, I see the following:

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog

I don't know what the "-" means at the beginning of -/var/log/maillog but when I look in that file it's clear that a lot is being logged there.

Additionally, a lot is being logged in this file:

/var/log/exim_mainlog

I since added to /etc/exim.conf this line:

no_message_logs

I thought that that would disable mail logging (I did restart exim), but when I look at /var/log/maillog and at /var/log/exim_mainlog both files are still receiving new log entries.

Question: How can I disable most/all exim logging?

3) When I look in /var/log/exim_paniclog, I see a ton of entries like this one:

2010-12-19 04:03:32 1PUFB1-0006xZ-GF User 0 set for local_delivery transport is on the never_users list

After looking around for a while, it seems as though that means exim is trying to deliver to the root email address. What's the best way to handle these mail deliveries to root while using as few CPU resources as possible?

Best Answer

As noted load average is related to the number of waiting processes in the run queue. If each of those processes has very little work to do and frees the processor quickly you can handle much larger load averages than the common 1 per CPU rule of thumb.

Mail is pretty much the perfect example of this, each process needs CPU to send a message but very, very little. I've seen mail systems running sendmail at a load average in the 25 to 35 range, and the system is still interactive and working fine.

Mark