Linux – fio config to measure IOPS against provider SLA

benchmarklinuxperformance

So a provider has given us 500 IOPS/TB as their SLA standards for disk performance in a VMWare & RAID5-SAN environment. This is apparently measured with:

16kB average transfer block size
3:1 read:write ratio
Multithreaded IO operations
80% random IO modelling
Read cache hit of 20%

What I want to do is determine whether any particular Linux VM is getting that performance, and then run the same benchmark with other providers so I can compare.

From looking around, it seems like fio is the most configurable to measure the above. The config I've got so far is:

[global]
blocksize=16k
rwmixread=75     # 3:1 read:write ratio
ramp_time=30
runtime=600
time_based
buffered=1
# size = free-ram * 80% / 5
# so we get a ~20% cache hit across the 5x processes
# this is for an 8GB ram host with 7.3GB free after buffers/cache
size=1180m

# create a mix to get to 80% random reads
# also means we'll be doing at least 5x IO operations in parallel
[sla-0]
readwrite=randrw:2

[sla-1]
readwrite=randrw:2

[sla-2]
readwrite=randrw

[sla-3]
readwrite=randrw

[sla-4]
readwrite=randrw

Suggestions for improvements? Is using buffered and the default ioengine the best way to go?

If I run this on an otherwise-idle 4x virtual-core machine with 8GB of RAM and 470GB of allocated storage, I'd expect to get 235 IOPS via the above (500 * 0.47). The results I get are:

sla-0: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=sync, iodepth=2
sla-1: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=sync, iodepth=2
sla-2: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=sync, iodepth=2
sla-3: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=sync, iodepth=2
sla-4: (g=0): rw=randrw, bs=16K-16K/16K-16K, ioengine=sync, iodepth=2
Starting 5 processes
sla-0: Laying out IO file(s) (1 file(s) / 1180MB)
sla-1: Laying out IO file(s) (1 file(s) / 1180MB)
sla-2: Laying out IO file(s) (1 file(s) / 1180MB)
sla-3: Laying out IO file(s) (1 file(s) / 1180MB)
sla-4: Laying out IO file(s) (1 file(s) / 1180MB)
Jobs: 5 (f=5): [mmmmm] [100.0% done] [5931K/1966K /s] [362/120 iops] [eta 00m:00s] 
sla-0: (groupid=0, jobs=1): err= 0: pid=16701
  read : io=1086MB, bw=1853KB/s, iops=115, runt=600003msec
    clat (usec): min=4, max=1771K, avg=8607.53, stdev=22114.44
    bw (KB/s) : min=    0, max= 4087, per=24.44%, avg=1914.96, stdev=1130.29
  write: io=372416KB, bw=635586B/s, iops=38, runt=600003msec
    clat (usec): min=6, max=2574, avg=57.38, stdev=79.65
    bw (KB/s) : min=    0, max=11119, per=26.07%, avg=679.63, stdev=517.84
  cpu          : usr=0.08%, sys=0.63%, ctx=64513, majf=0, minf=109
  IO depths    : 1=107.4%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=69474/23276, short=0/0
     lat (usec): 10=10.23%, 20=8.89%, 50=4.15%, 100=11.66%, 250=0.83%
     lat (usec): 500=1.48%, 750=1.41%, 1000=0.82%
     lat (msec): 2=0.83%, 4=1.56%, 10=47.07%, 20=5.91%, 50=4.24%
     lat (msec): 100=0.55%, 250=0.29%, 500=0.06%, 750=0.01%, 1000=0.01%
     lat (msec): 2000=0.01%
sla-1: (groupid=0, jobs=1): err= 0: pid=16702
  read : io=963360KB, bw=1605KB/s, iops=100, runt=600180msec
    clat (usec): min=4, max=2396K, avg=9934.23, stdev=30986.37
    bw (KB/s) : min=    0, max= 4657, per=21.64%, avg=1695.89, stdev=1273.00
  write: io=326000KB, bw=556206B/s, iops=33, runt=600180msec
    clat (usec): min=6, max=3882, avg=55.07, stdev=77.92
    bw (KB/s) : min=    0, max=10708, per=23.74%, avg=618.92, stdev=559.01
  cpu          : usr=0.08%, sys=0.53%, ctx=55500, majf=0, minf=129
  IO depths    : 1=108.5%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=60210/20375, short=0/0
     lat (usec): 10=11.36%, 20=9.63%, 50=3.56%, 100=11.97%, 250=0.81%
     lat (usec): 500=0.66%, 750=0.50%, 1000=0.37%
     lat (msec): 2=0.33%, 4=0.74%, 10=49.56%, 20=3.78%, 50=5.48%
     lat (msec): 100=0.60%, 250=0.43%, 500=0.16%, 750=0.04%, 1000=0.01%
     lat (msec): 2000=0.01%, >=2000=0.01%
sla-2: (groupid=0, jobs=1): err= 0: pid=16703
  read : io=827584KB, bw=1379KB/s, iops=86, runt=600012msec
    clat (usec): min=397, max=2396K, avg=11569.59, stdev=31237.03
    bw (KB/s) : min=    0, max= 4237, per=18.60%, avg=1457.59, stdev=1113.89
  write: io=276192KB, bw=471358B/s, iops=28, runt=600012msec
    clat (usec): min=8, max=8339, avg=63.95, stdev=121.52
    bw (KB/s) : min=    0, max= 8531, per=20.52%, avg=534.85, stdev=478.91
  cpu          : usr=0.07%, sys=0.54%, ctx=57019, majf=0, minf=89
  IO depths    : 1=109.9%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=51724/17262, short=0/0
     lat (usec): 10=0.98%, 20=5.38%, 50=3.53%, 100=13.68%, 250=0.92%
     lat (usec): 500=0.60%, 750=0.39%, 1000=0.22%
     lat (msec): 2=0.24%, 4=2.26%, 10=59.15%, 20=4.90%, 50=6.28%
     lat (msec): 100=0.78%, 250=0.48%, 500=0.18%, 750=0.03%, 1000=0.01%
     lat (msec): 2000=0.01%, >=2000=0.01%
sla-3: (groupid=0, jobs=1): err= 0: pid=16704
  read : io=865920KB, bw=1443KB/s, iops=90, runt=600005msec
    clat (usec): min=369, max=2396K, avg=11052.97, stdev=32396.85
    bw (KB/s) : min=    0, max= 5984, per=19.47%, avg=1525.97, stdev=1164.42
  write: io=285568KB, bw=487365B/s, iops=29, runt=600005msec
    clat (usec): min=7, max=11910, avg=65.72, stdev=154.09
    bw (KB/s) : min=    0, max=11064, per=21.38%, avg=557.30, stdev=534.59
  cpu          : usr=0.07%, sys=0.57%, ctx=59458, majf=0, minf=109
  IO depths    : 1=109.5%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=54120/17848, short=0/0
     lat (usec): 10=0.99%, 20=5.11%, 50=3.58%, 100=13.64%, 250=0.89%
     lat (usec): 500=0.71%, 750=0.48%, 1000=0.30%
     lat (msec): 2=0.70%, 4=4.00%, 10=57.63%, 20=5.21%, 50=5.40%
     lat (msec): 100=0.70%, 250=0.43%, 500=0.16%, 750=0.03%, 1000=0.01%
     lat (msec): 2000=0.01%, >=2000=0.01%
sla-4: (groupid=0, jobs=1): err= 0: pid=16705
  read : io=934752KB, bw=1558KB/s, iops=97, runt=600007msec
    clat (usec): min=187, max=2396K, avg=10236.87, stdev=26080.98
    bw (KB/s) : min=    0, max=11419, per=20.74%, avg=1625.28, stdev=1338.26
  write: io=304528KB, bw=519721B/s, iops=31, runt=600007msec
    clat (usec): min=7, max=7572, avg=67.29, stdev=117.27
    bw (KB/s) : min=    0, max=10772, per=22.06%, avg=575.17, stdev=560.68
  cpu          : usr=0.08%, sys=0.60%, ctx=63685, majf=0, minf=129
  IO depths    : 1=108.7%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=58422/19033, short=0/0
     lat (usec): 10=0.81%, 20=4.77%, 50=3.62%, 100=13.77%, 250=0.97%
     lat (usec): 500=1.45%, 750=0.64%, 1000=0.53%
     lat (msec): 2=1.75%, 4=4.71%, 10=53.48%, 20=6.92%, 50=5.53%
     lat (msec): 100=0.56%, 250=0.37%, 500=0.08%, 750=0.02%, 1000=0.01%
     lat (msec): 2000=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
   READ: io=4593MB, aggrb=7836KB/s, minb=1412KB/s, maxb=1897KB/s, mint=600003msec, maxt=600180msec
  WRITE: io=1528MB, aggrb=2607KB/s, minb=471KB/s, maxb=635KB/s, mint=600003msec, maxt=600180msec

Disk stats (read/write):
  dm-0: ios=298995/596154, merge=0/0, ticks=3107720/433061790, in_queue=436170340, util=99.68%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    sdb: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=-nan%

Summing up read and write IOPS for each job (why doesn't fio include this in its summary?) I get 647, which seems like it's exceeding their specified service levels. Anything obvious I'm missing, or are their metrics skewed massively for some workloads (specifically I'm interested in PostgreSQL with data-warehouse workloads).

Best Answer

SQL and data warehouses are more like 8:1 reads to writes, all small block, all random. In any case, anything but random reads is easy to cache, and is not likely causing you disk performance issues. Without knowing how they do their disks, it's hard to really help much, but consider asking them what they mean when they specify "RAID5-SAN environment".

Since they specify an SLA as IOPS per TB, I'd hazard a guess that each volume they provide to you is supposed to be on a separate RAID-5, allowing for more IOPS as they add volumes. Bad performance could easily be caused by bad raid neighbours: volumes on the same raid as you that take more than their fair share of storage resources. The problem with this is that sometimes your SLA will be exceeded, but sometimes you'll have to deal with high latency.

Start off by warning them you're unhappy with performance, and they might simply move you to a lower utilized raid which might solve all your problems. Also ask them if they have some raid-10 storage available, and maybe ask for a volume there instead of the raid 5. If the problem comes back, then consider getting your own storage or finding some other host that can provide you better performance.

Related Solutions

Linux – Measure Linux IOPS for a running system

Uhm... iostat on my system shows the IOPS:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               1.00        64.00         0.00         64          0

Might want to look at upgrading.

Linux SSD Performance – Direct vs Buffered IOPS with fio

Debian 7.11 with 3.2 kernel

Upgrade if at all possible. Not only do you get kernel improvements, but Wheezy is end of life.

Yes, you see higher utilization and queue depth when direct=1. The fio manual calls out this case in particular (emphasis mine):

iodepth=int

Number of I/O units to keep in flight against the file. Note that increasing iodepth beyond 1 will not affect synchronous ioengines (except for small degrees when verify_async is in use). Even async engines may impose OS restrictions causing the desired depth not to be achieved. This may happen on Linux when using libaio and not setting direct=1, since buffered I/O is not async on that OS. Keep an eye on the I/O depth distribution in the fio output to verify that the achieved depth is as expected

So libaio requires O_DIRECT for asynchronous, an important implementation detail to know. Someone asked if not direct with libaio was a good idea:

Is it valid to set direct=0 when using libaio?

You can do it but I wouldn't recommend it. With today's Linux kernels, libaio submission is likely to become blocking (and thus no longer asynchronous) without O_DIRECT which can limit the amount of parallel I/O achieved. There's a strong argument that fio examples should NOT encourage such a combination of options...

what does "queued" behavior mean in the man doc?

If you mean the sentence "Note that Linux may only support queued behavior with non-buffered I/O" (over in http://fio.readthedocs.io/en/latest/fio_doc.html#i-o-engine ) I think it's trying to say:

"Rather than blocking on the submission syscall until the I/O has gone down AND come back from the lowest disk device (blocking behaviour), when using direct=1 with libaio you can submit an I/O and have it asynchronously queued by the kernel allowing the submission syscall to return straight away and opening the opportunity for you to queue other submissions before the I/O completes".

Also try a control test with ioengine=psync and direct=0. Even synchronous writes with a cache can do a lot of IOPS.

All of this sidesteps the real question: what was the problem on the database workload you were running? Problem symptoms, software versions, configuration, performance metrics (iostat). The DBMS's implementation of I/O may be wildly different from what you simulated, the system calls used, multiple files and jobs doing I/O, any number of things. This is worth its own question if you want to investigate further.

Best Answer

Related Solutions

Linux – Measure Linux IOPS for a running system

Linux SSD Performance – Direct vs Buffered IOPS with fio

Is it valid to set direct=0 when using libaio?

Related Topic