We have just received two brand new Supermicro servers 1028U-TN10RT+ with 10 NVMe slots, two of them with Intel DC P3600 800gb hard drives.
We were eager to test the performance of the drives as the specification is promising very good read (up to 2.6gb/s) and write (up to 1gb/s) performance. We have put the two drives in a software RAID 1 configuration, as this is what we want to use in production. We have run test using FIO and the results are somewhat confusing.
Full results are below, but the recap is: two drives in a RAID1 array achieve random write speeds of ~550MB/s (this was one of the better runs), where a single drive (no RAID) can write with speeds of ~920MB/s.
Is there just so much overhead from using software RAID? Is there some additional tuning we can do?
System has 128GB of RAM and is running a CentOS 7.1, kernel version upgraded to 4.2.4.
fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite \
--bs=64k --direct=1 --size=32G --numjobs=8 --runtime=240 \
--group_reporting
Results on a single drive, xfs filesystem, directly mounted:
randwrite: (groupid=0, jobs=8): err= 0: pid=9307: Tue Oct 27 14:36:35 2015
write: io=217971MB, bw=929843KB/s, iops=14528, runt=240043msec
slat (usec): min=5, max=933, avg=24.10, stdev= 9.29
clat (usec): min=32, max=135283, avg=35212.65, stdev=27746.71
lat (usec): min=49, max=135300, avg=35237.02, stdev=27746.76
clat percentiles (usec):
| 1.00th=[ 215], 5.00th=[ 2224], 10.00th=[ 5600], 20.00th=[12992],
| 30.00th=[16768], 40.00th=[19328], 50.00th=[23168], 60.00th=[33536],
| 70.00th=[47872], 80.00th=[63232], 90.00th=[79360], 95.00th=[88576],
| 99.00th=[102912], 99.50th=[107008], 99.90th=[116224], 99.95th=[119296],
| 99.99th=[125440]
bw (KB /s): min=42411, max=298624, per=12.51%, avg=116326.24, stdev=24050.53
lat (usec) : 50=0.01%, 100=0.27%, 250=0.87%, 500=0.77%, 750=0.55%
lat (usec) : 1000=0.47%
lat (msec) : 2=1.67%, 4=3.43%, 10=7.17%, 20=27.37%, 50=28.86%
lat (msec) : 100=26.99%, 250=1.55%
cpu : usr=1.75%, sys=4.98%, ctx=3056950, majf=0, minf=56673
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=0/w=3487535/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: io=217971MB, aggrb=929842KB/s, minb=929842KB/s, maxb=929842KB/s, mint=240043msec, maxt=240043msec
Disk stats (read/write):
nvme2n1: ios=0/4691372, merge=0/0, ticks=0/154695600, in_queue=155446639, util=100.00%
Results when using md, RAID 1:
randwrite: (groupid=0, jobs=8): err= 0: pid=8553: Tue Oct 27 14:32:03 2015
write: io=130141MB, bw=555110KB/s, iops=8673, runt=240069msec
slat (usec): min=20, max=349051, avg=130.51, stdev=2000.03
clat (usec): min=59, max=912669, avg=58782.87, stdev=50750.42
lat (usec): min=95, max=927440, avg=58913.81, stdev=51010.14
clat percentiles (usec):
| 1.00th=[ 668], 5.00th=[ 3472], 10.00th=[ 8512], 20.00th=[21888],
| 30.00th=[32640], 40.00th=[41728], 50.00th=[48896], 60.00th=[58112],
| 70.00th=[71168], 80.00th=[86528], 90.00th=[114176], 95.00th=[142336],
| 99.00th=[216064], 99.50th=[250880], 99.90th=[577536], 99.95th=[716800],
| 99.99th=[872448]
bw (KB /s): min= 70, max=175104, per=12.56%, avg=69708.68, stdev=20589.85
lat (usec) : 100=0.02%, 250=0.29%, 500=0.43%, 750=0.38%, 1000=0.36%
lat (msec) : 2=1.22%, 4=2.98%, 10=5.56%, 20=7.47%, 50=32.45%
lat (msec) : 100=34.50%, 250=13.81%, 500=0.39%, 750=0.08%, 1000=0.05%
cpu : usr=1.28%, sys=6.46%, ctx=1727469, majf=0, minf=69488
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=0/w=2082262/d=0, short=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: io=130141MB, aggrb=555110KB/s, minb=555110KB/s, maxb=555110KB/s, mint=240069msec, maxt=240069msec
Disk stats (read/write):
md0: ios=0/2615652, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=11136/2630386, aggrmerge=0/0, aggrticks=10763/72152582, aggrin_queue=72527830, aggrutil=99.40%
nvme0n1: ios=22273/2619265, merge=0/0, ticks=21526/14920779, in_queue=14979917, util=49.15%
nvme1n1: ios=0/2641508, merge=0/0, ticks=0/129384385, in_queue=130075743, util=99.40%
mdadm –detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 27 13:12:34 2015
Raid Level : raid1
Array Size : 781278208 (745.08 GiB 800.03 GB)
Used Dev Size : 781278208 (745.08 GiB 800.03 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Oct 27 14:54:24 2015
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : localhost.localdomain:0 (local to host localhost.localdomain)
UUID : cf2ce291:0c52f361:bc40dffa:918595d9
Events : 706
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme0n1p1
1 259 1 1 active sync /dev/nvme1n1p1
Best Answer
It can be a side-effect of the internal write-intent bitmap used. Use
mdadm <dev> --grow --bitmap=none
to remove it, and re-try withfio
.Anyway, I strongly suggest you against going into production phase without a bitmap-enabled array, as a crash/power outage will force the array to do a full byte-per-byte scan/compare. A write intent bitmap will guarantee much faster recovery.