CPU or network I/O bound – Valuable Tech Notes

I'm trying to find out whether my server is network IO bound or CPU bound. I did look at the output of some of the typical tools to check the current status of the system (output of iostat, sar and top below) but I'm not quite sure if my interpretation of the output is correct.

This is my setup:

Application server (actually 2 of
those): (Debian, JBoss), Quad Core,
16 GB RAM
Database server: (Suse, MySQL), Quad Core, 64 GB RAM
Communication from app server to database server goes through a firewall (capable of 100 Mbit)

What the servers are doing:

The application deployed in JBoss reads in large amounts of text files, performs some plausability checks (which involves talking to the database) and in the end stores the data from the text files in our database

To improve throughput we just installed 4 additional instances of JBoss so there are now 5 instances of our application doing the import (no clustering). Unfortunately the performance didn't improve as much as I hoped.

These are the stats I gathered so far on the database server:

top:

top - 14:09:28 up 27 days,  9:45,  1 user,  load average: 0.65, 0.69, 0.83
Tasks:  92 total,   1 running,  91 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.8% us,  1.1% sy,  0.0% ni, 85.5% id,  1.7% wa,  0.1% hi,  0.8% si
Mem:  65884336k total, 61751244k used,  4133092k free,   524752k buffers
Swap:  8388600k total,  1097864k used,  7290736k free, 32520508k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9187 mysql     16   0 26.8g  26g 4168 S 49.3 41.7  12455:36 mysqld
    1 root      16   0   656   88   56 S  0.0  0.0   0:06.30 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.62 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.03 ksoftirqd/0

So the database server is pretty much idle and I didn't investigate it any further.

These are the stats I gathered so far on the application server:

top:

top - 14:31:11 up 43 days, 23:25,  1 user,  load average: 7.31, 7.13, 6.90
Tasks:  87 total,   2 running,  85 sleeping,   0 stopped,   0 zombie
Cpu(s): 58.0%us,  3.9%sy,  0.0%ni, 35.7%id,  0.0%wa,  0.1%hi,  2.2%si,  0.0%st
Mem:  16440520k total, 15894640k used,   545880k free,    79580k buffers
Swap:  8192616k total,       56k used,  8192560k free,  2968948k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8318 lvs       18   0 2773m 2.2g  13m S  101 14.0   1233:34 java
 8367 lvs       18   0 2752m 2.2g  13m S   65 13.9   1217:41 java
 8118 lvs       18   0 4731m 2.2g  13m S   58 14.2   1201:01 java
 8278 lvs       18   0 2755m 1.9g  13m S   21 12.3   1212:48 java
 8411 lvs       20   0 2743m 2.1g  13m S    8 13.4   1206:58 java
    1 root      18   0  6124  676  560 S    0  0.0   0:04.10 init
    2 root      RT   0     0    0    0 S    0  0.0   0:01.18 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.12 ksoftirqd/0

Here the load average is pretty high, also the CPU utilitzation. What puzzles me is the fact, that the CPU utilization ist not consistently high. This is what I got when I watched CPU usage for 20 seconds using iostat:

lvs@ftpslavedev:~$ iostat -c | head -3 ; iostat -c 1 20 | grep "^ *[0-9]"
Linux 2.6.18-6-amd64 (ftpslavedev)      10/07/09

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.40    0.00    0.34    0.16    0.00   97.10
          58.00    0.00    4.00    0.00    0.00   38.00
          50.62    0.00    3.23    0.00    0.00   46.15
          35.16    0.00    3.49    0.50    0.00   60.85
          75.43    0.00    5.46    0.00    0.00   19.11
          72.07    0.00    2.49    0.25    0.00   25.19
          50.12    0.00    4.24    0.00    0.00   45.64
          50.25    0.00    1.00    0.00    0.00   48.75
          32.18    0.00    3.71    0.00    0.00   64.11
          51.74    0.00    1.99    0.00    0.00   46.27
          53.12    0.00    2.99    1.00    0.00   42.89
          47.64    0.00    2.73    0.00    0.00   49.63
          13.18    0.00    2.74    0.00    0.00   84.08
           0.00    0.00    2.24    0.00    0.00   97.76
           0.25    0.00    3.23    0.00    0.00   96.52
           0.00    0.00    2.74    0.50    0.00   96.77
           0.00    0.00    2.99    0.00    0.00   97.01
          17.41    0.00    2.74    0.00    0.00   79.85
          23.19    0.00    2.99    0.75    0.00   73.07
          23.33    0.00    3.23    0.00    0.00   73.45

Because the CPU usage wasn't consistently high I wondered if the communication between app server and database could be the bottleneck and used sar:

lvs@ftpslavedev:~$ sar -n DEV | head -3  && sar -n DEV 1 20 | grep "eth0.*[1-9]"
Linux 2.6.18-6-amd64 (ftpslavedev)      10/07/09

00:00:01        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
16:55:30         eth0  22562.00  21297.00 9632456.00 5156549.00      0.00      0.00      0.00
16:55:32         eth0  19690.10  18800.00 8496563.37 4558991.09      0.00      0.00      0.00
16:55:34         eth0  22716.00  21214.00 10610874.00 5378377.00      0.00      0.00      0.00
16:55:36         eth0  17737.62  16509.90 9027099.01 4784231.68      0.00      0.00      0.00
16:55:38         eth0  10749.69   9625.79 6233610.69 2357184.91      0.00      0.00      0.00
16:55:39         eth0  20929.70  21002.97 5359857.43 4705525.74      0.00      0.00      0.00
16:55:41         eth0  17462.38  17476.24 6281078.22 5062188.12      0.00      0.00      0.00
16:55:44         eth0  19410.19  19368.52 4770402.78 3916590.74      0.00      0.00      0.00
16:55:46         eth0  13388.12  13277.23 3501303.96 2883294.06      0.00      0.00      0.00
16:55:48         eth0  25988.12  24862.38 10358798.02 5966493.07      0.00      0.00      0.00

The values in rxbyt/s and txbyt/s are bytes per second. If I convert those to Mbits per second I get something like

rxbyt/s    r Mbits/s    txbyt/s    t Mbits/s
 9632456    73,49        5156549    39,34
 8496563    64,82        4558991    34,78
10610874    80,95        5378377    41,03
 9027099    68,87        4784231    36,50
 6233610    47,56        2357184    17,98
 5359857    40,89        4705525    35,90
 6281078    47,92        5062188    38,62
 4770402    36,40        3916590    29,88
 3501303    26,71        2883294    22,00
10358798    79,03        5966493    45,52
 3694266    28,19        2192922    16,73

So here we often have values > 60 Mbit. Given that our second app server does exactly the same thing, I think it might well be the case that our firewall (as stated earlier only capable of handling 100 Mbit) could be the real reason for the high load average on the server and thus even adding more servers wouldn't help us much.

I don't know if my interpretation of the data is actually reasonable and so would appreciate your comments on this.

Best regards,

Stefan

Best Answer

Hm, complex problem :-/. One point:

I think it might well be the case that our firewall (as stated earlier only capable of handling 100 Mbit) could be the real reason for the high load average on the server and thus even adding more servers wouldn't help us much.

If the firewall throttled data transfers, you would not see high CPU load (%user above); rather you would see higher %iowait (as that includes network I/O). So that seems unlikely (unless the app does some kind of polling).

I think your best course of action is to examine more closely the high %user on the app servers; do some kind of profiling to find out what exactly the app does when it loads the CPU. That should give you a clue.

The firewall might also be worth looking into, as you are getting close to its capacity.

Best Answer

Related Solutions

Only one CPU load on multi core server

Linux – What’s going on with the server? High load, lots of idle CPU time, low disk utilization

Related Topic