Linux – How to Increase Speed of RAID 5 with mdadm, LUKS, and LVM

btrfslinuxlvmmdadm

I think I am kinda lost with my current server setup.
It is an HP Proliant dl160 gen 6, and I put 4 spinning disks with a setup that has mdmadm + luks + lvm and on top of it btrfs (maybe I went too far?) and it is really suffering on IO speed it reads around 50MB/s and writes around 2MB/s and I have a feeling that I messed up something.

One of the things I noted is that I set up mdadm on the block device (sbd) and not on the partitions (sdb1), would that affect something?

Here you can see the output of fio fio --name=randwrite --rw=randwrite --direct=1 --bs=16k --numjobs=128 --size=200M --runtime=60 --group_reporting when there is almost no use on the machine.

randwrite: (groupid=0, jobs=128): err= 0: pid=54290: Tue Oct 26 16:21:50 2021
  write: IOPS=137, BW=2193KiB/s (2246kB/s)(131MiB/61080msec); 0 zone resets
    clat (msec): min=180, max=2784, avg=924.48, stdev=318.02
     lat (msec): min=180, max=2784, avg=924.48, stdev=318.02
    clat percentiles (msec):
     |  1.00th=[  405],  5.00th=[  542], 10.00th=[  600], 20.00th=[  693],
     | 30.00th=[  760], 40.00th=[  818], 50.00th=[  860], 60.00th=[  927],
     | 70.00th=[ 1011], 80.00th=[ 1133], 90.00th=[ 1267], 95.00th=[ 1452],
     | 99.00th=[ 2165], 99.50th=[ 2232], 99.90th=[ 2635], 99.95th=[ 2769],
     | 99.99th=[ 2769]
   bw (  KiB/s): min= 3972, max= 4735, per=100.00%, avg=4097.79, stdev= 1.58, samples=8224
   iops        : min=  132, max=  295, avg=248.40, stdev= 0.26, samples=8224
  lat (msec)   : 250=0.04%, 500=2.82%, 750=25.96%, 1000=40.58%, 2000=28.67%
  lat (msec)   : >=2000=1.95%
  cpu          : usr=0.00%, sys=0.01%, ctx=18166, majf=0, minf=1412
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8372,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2193KiB/s (2246kB/s), 2193KiB/s-2193KiB/s (2246kB/s-2246kB/s), io=131MiB (137MB), run=61080-61080msec

Update 1 sequencial writes with dd

root@hp-proliant-dl160-g6-1:~# dd if=/dev/zero of=disk-test oflag=direct bs=512k count=100
100+0 records in 100+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 5.81511 s, 9.0 MB/s

Kernel: 5.4.0-89-generic

OS: Ubuntu 20.04.3

mdadm: 4.1-5ubuntu1.2

lvm2: 2.03.07-1ubuntu1

blkid output

/dev/mapper/dm_crypt-0: UUID="r7TBdk-1GZ4-zbUh-007u-BfuP-dtis-bTllYi" TYPE="LVM2_member"
/dev/sda2: UUID="64528d97-f05c-4f34-a238-f7b844b3bb58" UUID_SUB="263ae70e-d2b8-4dfe-bc6b-bbc2251a9f32" TYPE="btrfs" PARTUUID="494be592-3dad-4600-b954-e2912e410b8b"
/dev/sdb: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="4aeb4804-6380-5421-6aea-d090e6aea8a0" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdc: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="9d5a4ddd-bb9e-bb40-9b21-90f4151a5875" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdd: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="f08b5e6d-f971-c622-cd37-50af8ff4b308" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sde: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="362025d4-a4d2-8727-6853-e503c540c4f7" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/md0: UUID="a5b5bf95-1ff1-47f9-b3f6-059356e3af41" TYPE="crypto_LUKS"
/dev/mapper/vg0-lv--0: UUID="6db4e233-5d97-46d2-ac11-1ce6c72f5352" TYPE="swap"
/dev/mapper/vg0-lv--1: UUID="4e1a5131-cb91-48c4-8266-5b165d9f5071" UUID_SUB="e5fc407e-57c2-43eb-9b66-b00207ea6d91" TYPE="btrfs"
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/loop8: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/sda1: PARTUUID="fa30c3f5-6952-45f0-b844-9bfb46fa0224"

cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb[0] sdc[1] sdd[2] sde[4]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 2/15 pages [8KB], 65536KB chunk

unused devices: <none>

lshw -c disk

  *-disk
       description: SCSI Disk
       product: DT 101 G2
       vendor: Kingston
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: 1.00
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 7643MiB (8015MB)
       capabilities: removable
       configuration: ansiversion=4 logicalsectorsize=512 sectorsize=512
     *-medium
          physical id: 0
          logical name: /dev/sda
          size: 7643MiB (8015MB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: guid=6c166e3e-27c9-4edf-9b0d-e21892cbce41
  *-disk
       description: ATA Disk
       product: ST2000DM008-2FR1
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: 0001
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdb
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: ST2000DM008-2FR1
       physical id: 0.0.0
       bus info: scsi@2:0.0.0
       logical name: /dev/sdc
       version: 0001
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdc
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: WDC WD20EZBX-00A
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@3:0.0.0
       logical name: /dev/sdd
       version: 1A01
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sdd
          size: 1863GiB (2TB)
  *-disk
       description: ATA Disk
       product: WDC WD20EZBX-00A
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@4:0.0.0
       logical name: /dev/sde
       version: 1A01
       serial: xxxxxxxxxxxxxxxxxxxx
       size: 1863GiB (2TB)
       capabilities: removable
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
     *-medium
          physical id: 0
          logical name: /dev/sde
          size: 1863GiB (2TB)

Do you see anything that could be wrong in the setup?
Do you think that adding a nvme with a PCI card and use it for caching would be helpful?

Best Answer

The bad recorded performances stem from different factors:

  • mechanical disks are simply very bad at random read/write IO. To discover how bad they can be, simply append --sync=1 to your fio command (short story: they are incredibly bad, at least when compared to proper BBU RAID controllers or powerloss-protected SSDs);

  • RAID5 has an inherent write penalty due to stripe read/modify/write. Moreover it is strongly suggested to avoid it on multi-TB mechanical disks due to safety reasons. Having 4 disks, please seriously consider using RAID10 instead;

  • LUKS, providing software-based full-disk encryption, inevitably has its (significant) toll on both reads and writes;

  • using BTRFS, LVM is totally unnecessary. While a fat LVM-based volume will not impair performance in any meaningful way by itself, you are nonetheless inserting another IO layer and exposing yourself to (more) alignment issues;

  • finally, BTRFS itself is no particularly fast. Especially your slow sequential reads can be tracked to BTRFS horrible fragmentation (due it being CoW and enforcing 4K granularity - as a comparison, to obtain good performance from ZFS one should generally select 64K-128K records when using mechanical disks).

To have a baseline performance comparison, I strongly suggest redoing your IO stack measuring random & sequential read/write speed at each step. In other words:

  • create a RAID10 array and run dd and fio on the raw array (without a filesystem);

  • if full-disk encryption is really needed, use LUKS to create an encrypted device and re-run dd + fio on the raw encrypted device (again, with no filesystem). Compare to previous results to have an idea of what it means performance-wise;

  • try both XFS and BTRFS (running the usual dd + fio quick bench) to understand how two different filesystems behave. If BTRFS is too slow, try replacing it with lvmthin and XFS (but remember that in this case you will lose user data checksum, for which you need yet another layer - dmintegrity - itself commanding a significant performance hit).

If all this seems a mess, well, it really is so. By doing all the above you are just scratching storage performance: one had to consider real application behavior (rather than totally sequential dd or pure random fio results), cache hit ratio, IO pattern alignment, etc. But hey - few is much better than nothing, so lets start with something basic.

Related Topic