Linux SSD Performance – Direct vs Buffered IOPS with fio

linuxperformanceraidssd

I have a 30Tb-sized hardware RAID-6 system (LSI 9280-8e) of 10 DC-S4500 Intel SSDs that is used for database purposes. The OS Debian 7.11 with 3.2 kernel. The filesystem is XFS mounted with nobarrier option.

Seeing somewhat sluggish comparing to my expectations performance in random I/O, I started to investigate what's going on by running fio benchmarks.
And to my surprise when I just used fio on 1Tb file in random-read settings
with (iodepth=32 and ioengine=libaio) I get ~ 3000 IOPS which is much lower than what I was expecting.

random-read: (groupid=0, jobs=1): err= 0: pid=128531
  read : io=233364KB, bw=19149KB/s, iops=4787 , runt= 12187msec
  ...
  cpu          : usr=1.94%, sys=5.81%, ctx=58484, majf=0, minf=53
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=58341/w=0/d=0, short=r=0/w=0/d=0

However if I use direct=1 option (i.e. bypassing linux's buffer cache), I get ~ 40000 IOPS, which what I'd like to see.

random-read: (groupid=0, jobs=1): err= 0: pid=130252
  read : io=2063.7MB, bw=182028KB/s, iops=45507 , runt= 11609msec
....
  cpu          : usr=6.93%, sys=23.29%, ctx=56503, majf=0, minf=54
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=528291/w=0/d=0, short=r=0/w=0/d=0

I seem to have all the right settings for the SSD partition in the form of
the scheduler, read-ahead and rotational setting.

root@XX:~# cat /sys/block/sdd/queue/scheduler
[noop] deadline cfq 
root@XX:~# cat /sys/block/sdd/queue/rotational
0
root@XX:~# blockdev --getra /dev/sdd
0

Am I still missing something that lowers the buffered performance so much ? Or is it expected to see such a difference between DIRECT vs buffered ?

I also looked at iostat output during two runs
This is when direct=1 was used:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.00 48110.00    0.00 192544.00     0.00     8.00    27.83    0.58    0.58    0.00   0.02  99.60

This is a buffered run

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.00 4863.00    0.00 19780.00     0.00     8.13     0.89    0.18    0.18    0.00   0.18  85.60

So it looks like the key difference is the queue size (avgqu-sz), which is small when using buffered I/O.
I find it weird given that nr_requests and queue_depth are all high:

root@XX:~# cat /sys/block/sdd/queue/nr_requests
128
root@XX:~# cat /sys/block/sda/device/queue_depth
256

Any advice here ?

Best Answer

Debian 7.11 with 3.2 kernel

Upgrade if at all possible. Not only do you get kernel improvements, but Wheezy is end of life.

Yes, you see higher utilization and queue depth when direct=1. The fio manual calls out this case in particular (emphasis mine):

iodepth=int

Number of I/O units to keep in flight against the file. Note that increasing iodepth beyond 1 will not affect synchronous ioengines (except for small degrees when verify_async is in use). Even async engines may impose OS restrictions causing the desired depth not to be achieved. This may happen on Linux when using libaio and not setting direct=1, since buffered I/O is not async on that OS. Keep an eye on the I/O depth distribution in the fio output to verify that the achieved depth is as expected

So libaio requires O_DIRECT for asynchronous, an important implementation detail to know. Someone asked if not direct with libaio was a good idea:

Is it valid to set direct=0 when using libaio?

You can do it but I wouldn't recommend it. With today's Linux kernels, libaio submission is likely to become blocking (and thus no longer asynchronous) without O_DIRECT which can limit the amount of parallel I/O achieved. There's a strong argument that fio examples should NOT encourage such a combination of options...

what does "queued" behavior mean in the man doc?

If you mean the sentence "Note that Linux may only support queued behavior with non-buffered I/O" (over in http://fio.readthedocs.io/en/latest/fio_doc.html#i-o-engine ) I think it's trying to say:

"Rather than blocking on the submission syscall until the I/O has gone down AND come back from the lowest disk device (blocking behaviour), when using direct=1 with libaio you can submit an I/O and have it asynchronously queued by the kernel allowing the submission syscall to return straight away and opening the opportunity for you to queue other submissions before the I/O completes".

Also try a control test with ioengine=psync and direct=0. Even synchronous writes with a cache can do a lot of IOPS.

All of this sidesteps the real question: what was the problem on the database workload you were running? Problem symptoms, software versions, configuration, performance metrics (iostat). The DBMS's implementation of I/O may be wildly different from what you simulated, the system calls used, multiple files and jobs doing I/O, any number of things. This is worth its own question if you want to investigate further.

Numbers Everyone Should Know

L1 cache reference                             0.5 ns
Branch mispredict                              5 ns
L2 cache reference                             7 ns
Mutex lock/unlock                            100 ns (25)
Main memory reference                        100 ns
Compress 1K bytes with Zippy              10,000 ns (3,000)
Send 2K bytes over 1 Gbps network         20,000 ns
Read 1 MB sequentially from memory       250,000 ns
Round trip within same datacenter        500,000 ns
Disk seek                             10,000,000 ns
Read 1 MB sequentially from network   10,000,000 ns
Read 1 MB sequentially from disk      30,000,000 ns (20,000,000)
Send packet CA->Netherlands->CA      150,000,000 ns

It's from his presentation titled Designs, Lessons and Advice from Building Large Distributed Systems and you can get it here:

Dr Jeff Dean Keynote PDF or on slideshare.net

The talk was given at Large-Scale Distributed Systems and Middleware (LADIS) 2009.

Other Info

It's said that gcc -O4 emails your code to Jeff Dean for a rewrite.

Linux – memory leak? RHEL 5.5. RSS show ok, almost no free memory left, swap used heavily

Check the VmPeak out of /proc:

$ grep ^VmPea /proc/*/status | sort -n -k+2 | tail
/proc/32253/status:VmPeak:         86104 kB
/proc/5425/status:VmPeak:          86104 kB
/proc/9830/status:VmPeak:          86200 kB
/proc/8729/status:VmPeak:          86248 kB
/proc/399/status:VmPeak:           86472 kB
/proc/19084/status:VmPeak:         87148 kB
/proc/13092/status:VmPeak:         88272 kB
/proc/3065/status:VmPeak:         387968 kB
/proc/26432/status:VmPeak:        483480 kB
/proc/31679/status:VmPeak:        611780 kB

This should show which pid has tried to consume the most VM resources and should point at the source of the usage. If you don't see the massive amount of memory in this list then you need to look at the rest of the numbers in /proc/meminfo.

Best Answer

Is it valid to set direct=0 when using libaio?

Related Solutions

Performance IO – Are Networks Now Faster Than Disks?

Numbers Everyone Should Know

Other Info

Linux – memory leak? RHEL 5.5. RSS show ok, almost no free memory left, swap used heavily

Related Topic