Linux – Poor performance with Linux software raid-10

linuxlsisassoftware-raid

I have a machine with an 8 channel LSI SAS3008 controller chip, and individual drive testing shows I can write to any disk or all disks at between 174 MB/sec and 193 MB/sec with a sustained write speed:

This is the output from the command dd if=/dev/zero of=/dev/mapper/mpath?p1 bs=1G count=100 oflag=direct run in parallel to all 12 disks:

107374182400 bytes (107 GB) copied, 556.306 s, 193 MB/s
107374182400 bytes (107 GB) copied, 566.816 s, 189 MB/s
107374182400 bytes (107 GB) copied, 568.681 s, 189 MB/s
107374182400 bytes (107 GB) copied, 578.327 s, 186 MB/s
107374182400 bytes (107 GB) copied, 586.444 s, 183 MB/s
107374182400 bytes (107 GB) copied, 590.193 s, 182 MB/s
107374182400 bytes (107 GB) copied, 592.721 s, 181 MB/s
107374182400 bytes (107 GB) copied, 598.646 s, 179 MB/s
107374182400 bytes (107 GB) copied, 602.277 s, 178 MB/s
107374182400 bytes (107 GB) copied, 604.951 s, 177 MB/s
107374182400 bytes (107 GB) copied, 605.44 s, 177 MB/s

However, when I put these disks together as a software raid 10 device, I get around 500 MB/sec write speed. I expected to get about double that, since there is no penalty for accessing these disks at the same time.

I did notice the md10_raid10 process which I assume does the software raid itself is nearing 80%, and one core is always at 100% wait time, and 0% idle. Which core that is changes, however.

Additionally, the performance drops even further when using the buffer cache to write to the mounted EXT4 filesystem rather than using oflag=direct to bypass the cache. The disks report 25% busy (according to munin monitoring) but the disks are clearly not running hot, but I worry the md10 device itself may be.

Any suggestions on where to go next on this? I am attempting a hardware raid 10 config to compare, although I can only build a 10 disk unit it seems — that said, I hope to get 900 MB/sec writes sustained. I'll update this question as I discover more.

Edit 1:

If I use put a dd command in a tight loop writing to an ext4 partition mounted on that device, and I do not use the buffer cache (oflag=direct) I can get upwards of 950 MB/sec at peak and 855 MB/sec sustained with some alterations to the mount flags.

If I also read with iflag=direct at the same tim, I can get 480 MB/sec writes and 750 MB/sec reads now.

If I write without oflag=direct, thus using the buffer cache, I get 230 MB/sec writes and 1.2 MB/sec reads, but the machine seems to be very sluggish.

So, the question is, why would using the buffer cache so seriously affect performance? I have tried various disk queueing strategies including using 'noop' at the drive level and putting 'deadline' or 'cfq' on the appropriate multi path dm device, or deadline on all, or none on the dm and deadline on the backing drive. It seems like the backing drive should have none, and the multi path device should be the one I want, but this affects performance not at all, at least in the buffer cache case.

Best Answer

Edit:

Your dd oflag=direct observations might be due to power management issues. Use PowerTOP to see if your CPU's C-states are switched too often above C1 under write load. If they are, try tweaking PM to ensure the CPU is not going to sleep and re-run the benchmarks. Refer to your distro's documentation on how to do that - in most cases this will be the intel_idle.max_cstate=0 kernel bootline parameter, but YMMV.

The vast difference in performance between an O_DIRECT write and a buffered write might be due to:

  • when using O_DIRECT the CPU is not sent into C3+ sleep or
  • the CPU is sent into C3+, but it does not matter as much due to the significantly simplified processing when using O_DIRECT - just pointing to a zeroed memory area and issuing a DMA write request needs fewer cycles than buffered processing and will be less latency-sensitive

obsoleted answer:

This looks very much like a bottleneck caused by the single thread in md.

Reasoning

  • the controller's data sheet is promising 6,000 throughput
  • your parallel dd run is showing 170MB/s+ per drive, so the path is not restricted by the connecting PCIe bandwidth
  • you are seeing high, near-100% utilization rates for md10_raid10

While patches for multithreaded RAID5 checksum calculation have been committed to mdraid in 2013, I cannot find anything about similar RAID1 / RAID10 enhancements, so they might simply not be there.

Things to try

  • more than a single writing thread with dd, just to see if it changes anything
  • a different RAID10 implementation - LVM RAID 10 comes to mind, but you also might look at ZFS1 which has been designed with exactly this use case (many disks, no hardware RAID controllers) in mind
  • possibly a more recent Kernel version

FWIW, you rarely (if ever) will see write performance peak out (especially with a non-CoW-filesystem) on bandwidth with mechanic storage media. Most of the time, you will be restricted by seek times, so peak bandwidth should not be of great concern, as long as it meets your minimum requirements.


1 if you do ZFS, you should refine your testing method as writing all-zero blocks to a ZFS dataset might be arbitrary fast. Zeros are not written to disks but just linked to the all-zero-block if compression is enabled for the data set.

Related Topic