I think I am kinda lost with my current server setup.
It is an HP Proliant dl160 gen 6, and I put 4 spinning disks with a setup that has mdmadm + luks + lvm and on top of it btrfs (maybe I went too far?) and it is really suffering on IO speed it reads around 50MB/s and writes around 2MB/s and I have a feeling that I messed up something.
One of the things I noted is that I set up mdadm on the block device (sbd) and not on the partitions (sdb1), would that affect something?
Here you can see the output of fio fio --name=randwrite --rw=randwrite --direct=1 --bs=16k --numjobs=128 --size=200M --runtime=60 --group_reporting
when there is almost no use on the machine.
randwrite: (groupid=0, jobs=128): err= 0: pid=54290: Tue Oct 26 16:21:50 2021
write: IOPS=137, BW=2193KiB/s (2246kB/s)(131MiB/61080msec); 0 zone resets
clat (msec): min=180, max=2784, avg=924.48, stdev=318.02
lat (msec): min=180, max=2784, avg=924.48, stdev=318.02
clat percentiles (msec):
| 1.00th=[ 405], 5.00th=[ 542], 10.00th=[ 600], 20.00th=[ 693],
| 30.00th=[ 760], 40.00th=[ 818], 50.00th=[ 860], 60.00th=[ 927],
| 70.00th=[ 1011], 80.00th=[ 1133], 90.00th=[ 1267], 95.00th=[ 1452],
| 99.00th=[ 2165], 99.50th=[ 2232], 99.90th=[ 2635], 99.95th=[ 2769],
| 99.99th=[ 2769]
bw ( KiB/s): min= 3972, max= 4735, per=100.00%, avg=4097.79, stdev= 1.58, samples=8224
iops : min= 132, max= 295, avg=248.40, stdev= 0.26, samples=8224
lat (msec) : 250=0.04%, 500=2.82%, 750=25.96%, 1000=40.58%, 2000=28.67%
lat (msec) : >=2000=1.95%
cpu : usr=0.00%, sys=0.01%, ctx=18166, majf=0, minf=1412
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8372,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2193KiB/s (2246kB/s), 2193KiB/s-2193KiB/s (2246kB/s-2246kB/s), io=131MiB (137MB), run=61080-61080msec
Update 1 sequencial writes with dd
root@hp-proliant-dl160-g6-1:~# dd if=/dev/zero of=disk-test oflag=direct bs=512k count=100
100+0 records in 100+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 5.81511 s, 9.0 MB/s
Kernel: 5.4.0-89-generic
OS: Ubuntu 20.04.3
mdadm: 4.1-5ubuntu1.2
lvm2: 2.03.07-1ubuntu1
blkid output
/dev/mapper/dm_crypt-0: UUID="r7TBdk-1GZ4-zbUh-007u-BfuP-dtis-bTllYi" TYPE="LVM2_member"
/dev/sda2: UUID="64528d97-f05c-4f34-a238-f7b844b3bb58" UUID_SUB="263ae70e-d2b8-4dfe-bc6b-bbc2251a9f32" TYPE="btrfs" PARTUUID="494be592-3dad-4600-b954-e2912e410b8b"
/dev/sdb: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="4aeb4804-6380-5421-6aea-d090e6aea8a0" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdc: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="9d5a4ddd-bb9e-bb40-9b21-90f4151a5875" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sdd: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="f08b5e6d-f971-c622-cd37-50af8ff4b308" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/sde: UUID="478e8132-7783-1fb1-936a-358d06dbd871" UUID_SUB="362025d4-a4d2-8727-6853-e503c540c4f7" LABEL="ubuntu-server:0" TYPE="linux_raid_member"
/dev/md0: UUID="a5b5bf95-1ff1-47f9-b3f6-059356e3af41" TYPE="crypto_LUKS"
/dev/mapper/vg0-lv--0: UUID="6db4e233-5d97-46d2-ac11-1ce6c72f5352" TYPE="swap"
/dev/mapper/vg0-lv--1: UUID="4e1a5131-cb91-48c4-8266-5b165d9f5071" UUID_SUB="e5fc407e-57c2-43eb-9b66-b00207ea6d91" TYPE="btrfs"
/dev/loop0: TYPE="squashfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/loop8: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/sda1: PARTUUID="fa30c3f5-6952-45f0-b844-9bfb46fa0224"
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb[0] sdc[1] sdd[2] sde[4]
5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 2/15 pages [8KB], 65536KB chunk
unused devices: <none>
lshw -c disk
*-disk
description: SCSI Disk
product: DT 101 G2
vendor: Kingston
physical id: 0.0.0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 1.00
serial: xxxxxxxxxxxxxxxxxxxx
size: 7643MiB (8015MB)
capabilities: removable
configuration: ansiversion=4 logicalsectorsize=512 sectorsize=512
*-medium
physical id: 0
logical name: /dev/sda
size: 7643MiB (8015MB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: guid=6c166e3e-27c9-4edf-9b0d-e21892cbce41
*-disk
description: ATA Disk
product: ST2000DM008-2FR1
physical id: 0.0.0
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: 0001
serial: xxxxxxxxxxxxxxxxxxxx
size: 1863GiB (2TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
*-medium
physical id: 0
logical name: /dev/sdb
size: 1863GiB (2TB)
*-disk
description: ATA Disk
product: ST2000DM008-2FR1
physical id: 0.0.0
bus info: scsi@2:0.0.0
logical name: /dev/sdc
version: 0001
serial: xxxxxxxxxxxxxxxxxxxx
size: 1863GiB (2TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
*-medium
physical id: 0
logical name: /dev/sdc
size: 1863GiB (2TB)
*-disk
description: ATA Disk
product: WDC WD20EZBX-00A
vendor: Western Digital
physical id: 0.0.0
bus info: scsi@3:0.0.0
logical name: /dev/sdd
version: 1A01
serial: xxxxxxxxxxxxxxxxxxxx
size: 1863GiB (2TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
*-medium
physical id: 0
logical name: /dev/sdd
size: 1863GiB (2TB)
*-disk
description: ATA Disk
product: WDC WD20EZBX-00A
vendor: Western Digital
physical id: 0.0.0
bus info: scsi@4:0.0.0
logical name: /dev/sde
version: 1A01
serial: xxxxxxxxxxxxxxxxxxxx
size: 1863GiB (2TB)
capabilities: removable
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
*-medium
physical id: 0
logical name: /dev/sde
size: 1863GiB (2TB)
Do you see anything that could be wrong in the setup?
Do you think that adding a nvme with a PCI card and use it for caching would be helpful?
Best Answer
The bad recorded performances stem from different factors:
mechanical disks are simply very bad at random read/write IO. To discover how bad they can be, simply append
--sync=1
to yourfio
command (short story: they are incredibly bad, at least when compared to proper BBU RAID controllers or powerloss-protected SSDs);RAID5 has an inherent write penalty due to stripe read/modify/write. Moreover it is strongly suggested to avoid it on multi-TB mechanical disks due to safety reasons. Having 4 disks, please seriously consider using RAID10 instead;
LUKS, providing software-based full-disk encryption, inevitably has its (significant) toll on both reads and writes;
using BTRFS, LVM is totally unnecessary. While a fat LVM-based volume will not impair performance in any meaningful way by itself, you are nonetheless inserting another IO layer and exposing yourself to (more) alignment issues;
finally, BTRFS itself is no particularly fast. Especially your slow sequential reads can be tracked to BTRFS horrible fragmentation (due it being CoW and enforcing 4K granularity - as a comparison, to obtain good performance from ZFS one should generally select 64K-128K records when using mechanical disks).
To have a baseline performance comparison, I strongly suggest redoing your IO stack measuring random & sequential read/write speed at each step. In other words:
create a RAID10 array and run
dd
andfio
on the raw array (without a filesystem);if full-disk encryption is really needed, use LUKS to create an encrypted device and re-run
dd
+fio
on the raw encrypted device (again, with no filesystem). Compare to previous results to have an idea of what it means performance-wise;try both XFS and BTRFS (running the usual
dd
+fio
quick bench) to understand how two different filesystems behave. If BTRFS is too slow, try replacing it with lvmthin and XFS (but remember that in this case you will lose user data checksum, for which you need yet another layer - dmintegrity - itself commanding a significant performance hit).If all this seems a mess, well, it really is so. By doing all the above you are just scratching storage performance: one had to consider real application behavior (rather than totally sequential
dd
or pure randomfio
results), cache hit ratio, IO pattern alignment, etc. But hey - few is much better than nothing, so lets start with something basic.