Linux I/O bottleneck with data-movers

bottleneckiolinuxubuntu-10.04

I have a 24 core machine with 94.6GiB RAM running Ubuntu server 10.04. The box is experiencing high %iowait, unlike another server we have (4 cores) running the same types and amounts of processes. Both machines are connected to a VNX Raid fileserver, the 24-core machine via 4 FC cards, and the other via 2 gigabit ethernet cards. The 4-core machine currently outperforms the 24-core machine, has higher CPU usage and lower %iowait.

In 9 days uptime, %iowait averages at 16%, and is routinely above 30%. Most of the time CPU usage is very low, around 5% (due to the high iowait). There is ample free memory.

One thing I don't understand is why all the data appears to be going through device sdc rather than going through the data movers directly:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.11    0.39    0.75   16.01    0.00   76.74

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00       1232          0
sdb               0.00         0.00         0.00       2960          0
sdc               1.53        43.71        44.54   36726612   37425026
dm-0              0.43        27.69         0.32   23269498     268696
dm-1              1.00         1.86         7.74    1566234    6500432
dm-2              0.96         1.72         5.97    1442482    5014376
dm-3              0.49         9.57         0.18    8040490     153272
dm-4              0.00         0.00         0.00       1794         24
dm-5              0.00         0.00         0.00        296          0

Another piece of the puzzle is that tasks frequently go into uninteruptable sleep mode (in top), also probably due to the io holdup.

What can I look at to help diagnose the problem? Why is all the data going through /dev/sdc? Is that normal?

UPDATE:

The network connection and VNX read/write capacity have been ruled out as bottlenecks. We can reach speeds of 800MB/s with the 4 bonded NICs (round-robin). The fiber channel cards are not yet being used. The VNX is well able to handle the IO (RAID6, 30x2TB 7.2kRPM disks per pool in two pools (60 disks total), about 60% read).

Ignore above about dm and sdc, they are all internal disks, and not part of the problem.

We think the issue might be with the nfs mounts or TCP (we have 5 mounts to 5 partitions on the VNX), but don't know what exactly. Any advice?

Best Answer

First of all if your CPUs (and damn! That's a lot 24) eat data faster than what can provide the data storage, then you get iowait. That's when the kernel pause a process during a blocking io (a read that comes too slow or a sync write).
So check that the storage can provide enough throughput for 24 cores.

Example, let's assume your storage can provide 500MB/s throughput, that you are connected via 2 Gigabit Ethernet line (bond), the network will already limit the maximum throughput to something around 100-180 MB/s. If your process eat data at the speed of 50 MB/s and that you run 4 threads on your 4 core machine: 4 x 50 MB/s = 200 MB/s consumed. If the network can sustained the 180MB/s then you wil not have much latency and your CPUs will be loaded. The network here is a small bottleneck.
Now if you scale this up to 24 cores and 24 threads, you would need 1200 MB/s, even if you change the wiring to allow such throughput, your storage system does not provide more than 500 MB/s, it becomes a bottleneck.

When it comes to io wait, bottlenecks can be everywhere. Not only on the physical layers, but also in software and kernel space buffers. It really depends on the usage patterns. But as the software bottlenecks are much harder to identify, it usually is preferrable to check the theorical throughput on the hardware before investigating the software stacks.

As said, an iowait occurs when a process make a read and the data takes time to arrive, or when it makes a sync write and the data modification acknowledgment takes its time. During a sync write, the process enter uninterruptible sleep so data don't get corrupted. There is one handy tool to see which call makes a process hang: latencytop. It is not the only one of its kind, but you can give it a try.

Note: for your information, dm stands for device mapper not data movers.