NVMe drives write performance in software RAID 1

benchmarkmdadmnvmeraidsupermicro

We have just received two brand new Supermicro servers 1028U-TN10RT+ with 10 NVMe slots, two of them with Intel DC P3600 800gb hard drives.

We were eager to test the performance of the drives as the specification is promising very good read (up to 2.6gb/s) and write (up to 1gb/s) performance. We have put the two drives in a software RAID 1 configuration, as this is what we want to use in production. We have run test using FIO and the results are somewhat confusing.

Full results are below, but the recap is: two drives in a RAID1 array achieve random write speeds of ~550MB/s (this was one of the better runs), where a single drive (no RAID) can write with speeds of ~920MB/s.

Is there just so much overhead from using software RAID? Is there some additional tuning we can do?

System has 128GB of RAM and is running a CentOS 7.1, kernel version upgraded to 4.2.4.

fio --name=randwrite --ioengine=libaio --iodepth=64 --rw=randwrite \
    --bs=64k --direct=1 --size=32G --numjobs=8 --runtime=240 \
    --group_reporting

Results on a single drive, xfs filesystem, directly mounted:

randwrite: (groupid=0, jobs=8): err= 0: pid=9307: Tue Oct 27 14:36:35 2015
  write: io=217971MB, bw=929843KB/s, iops=14528, runt=240043msec
    slat (usec): min=5, max=933, avg=24.10, stdev= 9.29
    clat (usec): min=32, max=135283, avg=35212.65, stdev=27746.71
     lat (usec): min=49, max=135300, avg=35237.02, stdev=27746.76
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[ 2224], 10.00th=[ 5600], 20.00th=[12992],
     | 30.00th=[16768], 40.00th=[19328], 50.00th=[23168], 60.00th=[33536],
     | 70.00th=[47872], 80.00th=[63232], 90.00th=[79360], 95.00th=[88576],
     | 99.00th=[102912], 99.50th=[107008], 99.90th=[116224], 99.95th=[119296],
     | 99.99th=[125440]
    bw (KB  /s): min=42411, max=298624, per=12.51%, avg=116326.24, stdev=24050.53
    lat (usec) : 50=0.01%, 100=0.27%, 250=0.87%, 500=0.77%, 750=0.55%
    lat (usec) : 1000=0.47%
    lat (msec) : 2=1.67%, 4=3.43%, 10=7.17%, 20=27.37%, 50=28.86%
    lat (msec) : 100=26.99%, 250=1.55%
  cpu          : usr=1.75%, sys=4.98%, ctx=3056950, majf=0, minf=56673
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=3487535/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=217971MB, aggrb=929842KB/s, minb=929842KB/s, maxb=929842KB/s, mint=240043msec, maxt=240043msec

Disk stats (read/write):
  nvme2n1: ios=0/4691372, merge=0/0, ticks=0/154695600, in_queue=155446639, util=100.00%

Results when using md, RAID 1:

randwrite: (groupid=0, jobs=8): err= 0: pid=8553: Tue Oct 27 14:32:03 2015
  write: io=130141MB, bw=555110KB/s, iops=8673, runt=240069msec
    slat (usec): min=20, max=349051, avg=130.51, stdev=2000.03
    clat (usec): min=59, max=912669, avg=58782.87, stdev=50750.42
     lat (usec): min=95, max=927440, avg=58913.81, stdev=51010.14
    clat percentiles (usec):
     |  1.00th=[  668],  5.00th=[ 3472], 10.00th=[ 8512], 20.00th=[21888],
     | 30.00th=[32640], 40.00th=[41728], 50.00th=[48896], 60.00th=[58112],
     | 70.00th=[71168], 80.00th=[86528], 90.00th=[114176], 95.00th=[142336],
     | 99.00th=[216064], 99.50th=[250880], 99.90th=[577536], 99.95th=[716800],
     | 99.99th=[872448]
    bw (KB  /s): min=   70, max=175104, per=12.56%, avg=69708.68, stdev=20589.85
    lat (usec) : 100=0.02%, 250=0.29%, 500=0.43%, 750=0.38%, 1000=0.36%
    lat (msec) : 2=1.22%, 4=2.98%, 10=5.56%, 20=7.47%, 50=32.45%
    lat (msec) : 100=34.50%, 250=13.81%, 500=0.39%, 750=0.08%, 1000=0.05%
  cpu          : usr=1.28%, sys=6.46%, ctx=1727469, majf=0, minf=69488
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=2082262/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=130141MB, aggrb=555110KB/s, minb=555110KB/s, maxb=555110KB/s, mint=240069msec, maxt=240069msec

Disk stats (read/write):
    md0: ios=0/2615652, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=11136/2630386, aggrmerge=0/0, aggrticks=10763/72152582, aggrin_queue=72527830, aggrutil=99.40%
  nvme0n1: ios=22273/2619265, merge=0/0, ticks=21526/14920779, in_queue=14979917, util=49.15%
  nvme1n1: ios=0/2641508, merge=0/0, ticks=0/129384385, in_queue=130075743, util=99.40%

mdadm –detail /dev/md0

/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 27 13:12:34 2015
     Raid Level : raid1
     Array Size : 781278208 (745.08 GiB 800.03 GB)
  Used Dev Size : 781278208 (745.08 GiB 800.03 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Oct 27 14:54:24 2015
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:0  (local to host localhost.localdomain)
           UUID : cf2ce291:0c52f361:bc40dffa:918595d9
         Events : 706

    Number   Major   Minor   RaidDevice State
       0     259        3        0      active sync   /dev/nvme0n1p1
       1     259        1        1      active sync   /dev/nvme1n1p1

Best Answer

It can be a side-effect of the internal write-intent bitmap used. Use mdadm <dev> --grow --bitmap=none to remove it, and re-try with fio.

Anyway, I strongly suggest you against going into production phase without a bitmap-enabled array, as a crash/power outage will force the array to do a full byte-per-byte scan/compare. A write intent bitmap will guarantee much faster recovery.

Best Answer

Related Solutions

RAID Performance – Software vs Hardware RAID Performance and Cache Usage

Linux – NVMe drives with PLX bridge are slow in performance

Related Topic