I'm trying to find out whether my server is network IO bound or CPU bound. I did look at the output of some of the typical tools to check the current status of the system (output of iostat, sar and top below) but I'm not quite sure if my interpretation of the output is correct.
This is my setup:
- Application server (actually 2 of
those): (Debian, JBoss), Quad Core,
16 GB RAM - Database server: (Suse, MySQL), Quad Core, 64 GB RAM
- Communication from app server to database server goes through a firewall (capable of 100 Mbit)
What the servers are doing:
- The application deployed in JBoss reads in large amounts of text files, performs some plausability checks (which involves talking to the database) and in the end stores the data from the text files in our database
To improve throughput we just installed 4 additional instances of JBoss so there are now 5 instances of our application doing the import (no clustering). Unfortunately the performance didn't improve as much as I hoped.
These are the stats I gathered so far on the database server:
top:
top - 14:09:28 up 27 days, 9:45, 1 user, load average: 0.65, 0.69, 0.83
Tasks: 92 total, 1 running, 91 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.8% us, 1.1% sy, 0.0% ni, 85.5% id, 1.7% wa, 0.1% hi, 0.8% si
Mem: 65884336k total, 61751244k used, 4133092k free, 524752k buffers
Swap: 8388600k total, 1097864k used, 7290736k free, 32520508k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9187 mysql 16 0 26.8g 26g 4168 S 49.3 41.7 12455:36 mysqld
1 root 16 0 656 88 56 S 0.0 0.0 0:06.30 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.62 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/0
So the database server is pretty much idle and I didn't investigate it any further.
These are the stats I gathered so far on the application server:
top:
top - 14:31:11 up 43 days, 23:25, 1 user, load average: 7.31, 7.13, 6.90
Tasks: 87 total, 2 running, 85 sleeping, 0 stopped, 0 zombie
Cpu(s): 58.0%us, 3.9%sy, 0.0%ni, 35.7%id, 0.0%wa, 0.1%hi, 2.2%si, 0.0%st
Mem: 16440520k total, 15894640k used, 545880k free, 79580k buffers
Swap: 8192616k total, 56k used, 8192560k free, 2968948k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8318 lvs 18 0 2773m 2.2g 13m S 101 14.0 1233:34 java
8367 lvs 18 0 2752m 2.2g 13m S 65 13.9 1217:41 java
8118 lvs 18 0 4731m 2.2g 13m S 58 14.2 1201:01 java
8278 lvs 18 0 2755m 1.9g 13m S 21 12.3 1212:48 java
8411 lvs 20 0 2743m 2.1g 13m S 8 13.4 1206:58 java
1 root 18 0 6124 676 560 S 0 0.0 0:04.10 init
2 root RT 0 0 0 0 S 0 0.0 0:01.18 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.12 ksoftirqd/0
Here the load average is pretty high, also the CPU utilitzation. What puzzles me is the fact, that the CPU utilization ist not consistently high. This is what I got when I watched CPU usage for 20 seconds using iostat:
lvs@ftpslavedev:~$ iostat -c | head -3 ; iostat -c 1 20 | grep "^ *[0-9]"
Linux 2.6.18-6-amd64 (ftpslavedev) 10/07/09
avg-cpu: %user %nice %system %iowait %steal %idle
2.40 0.00 0.34 0.16 0.00 97.10
58.00 0.00 4.00 0.00 0.00 38.00
50.62 0.00 3.23 0.00 0.00 46.15
35.16 0.00 3.49 0.50 0.00 60.85
75.43 0.00 5.46 0.00 0.00 19.11
72.07 0.00 2.49 0.25 0.00 25.19
50.12 0.00 4.24 0.00 0.00 45.64
50.25 0.00 1.00 0.00 0.00 48.75
32.18 0.00 3.71 0.00 0.00 64.11
51.74 0.00 1.99 0.00 0.00 46.27
53.12 0.00 2.99 1.00 0.00 42.89
47.64 0.00 2.73 0.00 0.00 49.63
13.18 0.00 2.74 0.00 0.00 84.08
0.00 0.00 2.24 0.00 0.00 97.76
0.25 0.00 3.23 0.00 0.00 96.52
0.00 0.00 2.74 0.50 0.00 96.77
0.00 0.00 2.99 0.00 0.00 97.01
17.41 0.00 2.74 0.00 0.00 79.85
23.19 0.00 2.99 0.75 0.00 73.07
23.33 0.00 3.23 0.00 0.00 73.45
Because the CPU usage wasn't consistently high I wondered if the communication between app server and database could be the bottleneck and used sar:
lvs@ftpslavedev:~$ sar -n DEV | head -3 && sar -n DEV 1 20 | grep "eth0.*[1-9]"
Linux 2.6.18-6-amd64 (ftpslavedev) 10/07/09
00:00:01 IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s
16:55:30 eth0 22562.00 21297.00 9632456.00 5156549.00 0.00 0.00 0.00
16:55:32 eth0 19690.10 18800.00 8496563.37 4558991.09 0.00 0.00 0.00
16:55:34 eth0 22716.00 21214.00 10610874.00 5378377.00 0.00 0.00 0.00
16:55:36 eth0 17737.62 16509.90 9027099.01 4784231.68 0.00 0.00 0.00
16:55:38 eth0 10749.69 9625.79 6233610.69 2357184.91 0.00 0.00 0.00
16:55:39 eth0 20929.70 21002.97 5359857.43 4705525.74 0.00 0.00 0.00
16:55:41 eth0 17462.38 17476.24 6281078.22 5062188.12 0.00 0.00 0.00
16:55:44 eth0 19410.19 19368.52 4770402.78 3916590.74 0.00 0.00 0.00
16:55:46 eth0 13388.12 13277.23 3501303.96 2883294.06 0.00 0.00 0.00
16:55:48 eth0 25988.12 24862.38 10358798.02 5966493.07 0.00 0.00 0.00
The values in rxbyt/s and txbyt/s are bytes per second. If I convert those to Mbits per second I get something like
rxbyt/s r Mbits/s txbyt/s t Mbits/s
9632456 73,49 5156549 39,34
8496563 64,82 4558991 34,78
10610874 80,95 5378377 41,03
9027099 68,87 4784231 36,50
6233610 47,56 2357184 17,98
5359857 40,89 4705525 35,90
6281078 47,92 5062188 38,62
4770402 36,40 3916590 29,88
3501303 26,71 2883294 22,00
10358798 79,03 5966493 45,52
3694266 28,19 2192922 16,73
So here we often have values > 60 Mbit. Given that our second app server does exactly the same thing, I think it might well be the case that our firewall (as stated earlier only capable of handling 100 Mbit) could be the real reason for the high load average on the server and thus even adding more servers wouldn't help us much.
I don't know if my interpretation of the data is actually reasonable and so would appreciate your comments on this.
Best regards,
Stefan
Best Answer
Hm, complex problem :-/. One point:
If the firewall throttled data transfers, you would not see high CPU load (%user above); rather you would see higher %iowait (as that includes network I/O). So that seems unlikely (unless the app does some kind of polling).
I think your best course of action is to examine more closely the high %user on the app servers; do some kind of profiling to find out what exactly the app does when it loads the CPU. That should give you a clue.
The firewall might also be worth looking into, as you are getting close to its capacity.