ISCSI performance between SAN and hypervisor horribly slow

iscsistorage-area-networkvirtualization

We have a poor mans SAN setup in a 1U Ubuntu server running iSCSI-Target with two 300GB drives in RAID-0. We then are using it for block level storage for virtual machines. The hypervisor is connected to the SAN via gigabit on a dedicated VLAN and interfaces.

We only have a single virtual machine setup and doing some benchmarks. If we run hdparm -t /dev/sda1 from the virtual machine, we get "ok" performance of 75MB/s from the virtual machine to the SAN. Then we basically compile a package with ./configure and make. Things start ok, but then all the sudden the load average on the SAN grows to 7+ and things slow down to a crawl. When we SSH into the SAN and run top, sure the load is 7+, but the CPU usage is basically nothing, also the server has 1.5GB of memory available. When we kill the compile on the virtual machine, slowly the LOAD on the SAN goes back to sub 1 figures.

What in the world is causing this? How can we diagnosis this further?

Here are two screenshot from the SAN during high load.

1> Output of iotop on the SAN:

http://imgur.com/2doVP

2> Output of top on the SAN:

http://i.stack.imgur.com/UK0f8.png

Best Answer

You should see a significant increase of performance after enabling write caching on the target (details depends on the implementation - what are you using, tgt?) and your disks

hdparm -W 1 /dev/sda
hdparm -W 1 /dev/sdb

There is a price however: this will endanger data integrity (especially if you run databases) in the case of a power outage or a system hang of the SAN, as data which is thought to have been permanently written to disk, only resided in volatile DRAM. To mitigate this risk, you should use a controller with BBWC (battery-backed write cache) where data would survive a power outage for a while (typically 1-2 days).

The main "problem" with ESXi is that it is constantly sync()ing the disks. The need to write metadata to VMFS (if you have it) makes it even worse. The vmware community forums are full of "my disks are slow" posts whenever people are using controllers without write caches.