Linux – Why is io scheduler not merging requests

iolinux

After some time (or amount of data written) after server reboot the write speed drops to less than 1 MB/s. This is regardless of which filesystem (also raw partition) and regardless of HDD (HW RAID) or SSD (SW RAID with SSD connected to motherboard AHCI ports, not the raid board). We are testing with command dd if=/dev/zero of=tst bs=1M oflag=dsync (I have tried also 1k and also without dsync, but performance was not better).

The only weird thing I noticed was that the avgrq-sz is only 8 in the iostat output (on other tested servers it was more than 600) and that the req/s is about 100 (also on SSD). Running more dd in parallel gave each of them 1 MB/s and each of them about 100 req/s.

Sample iostat -xN 1 output:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  124.00     0.00   496.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     6.00    0.00  124.00     0.00   728.00    11.74     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00

Iostat output with 8x dd running:

sdc               0.00    64.00    0.00  959.00     0.00  7560.00    15.77     0.00    0.00    0.00    0.00   0.00   0.00

lsblk -O output is consistent with other servers, which do not have this problem (like MIN-IO, RQ-SIZE, LOG-SEC). Current kernel is 4.9.16-gentoo, but the issue started with older kernel. Running dd with oflag=direct is fast.

EDIT: Based on shodanshok's answer I now see that the requests are indeed small, but the question is why io scheduler does not merge them into larger requests? I have tried both cfq and deadline schedulers. Is there anything I can check (or compare against other servers)?

Output when running with oflag=direct (speed is OK):

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    2.00  649.00     8.00 141312.00   434.16     2.78    4.26    2.00    4.27   1.53  99.60
sdc               0.00     0.00    0.00  681.00     0.00 143088.00   420.23     2.71    3.99    0.00    3.99   1.46  99.20
sdc               0.00     0.00    2.00  693.00     8.00 146160.00   420.63     2.58    3.71    0.00    3.72   1.42  98.80
sdc               0.00    49.00    2.00  710.00     8.00 146928.00   412.74     2.68    3.76   22.00    3.71   1.39  99.20
sdc               0.00     0.00    1.00  642.00     4.00 136696.00   425.19     2.43    3.79   60.00    3.71   1.42  91.60

Server is Dell PowerEdge R330 with 32 GB RAM, LSI MegaRAID 3108 controller with HDDs, SSDs connected to onboard SATA, Intel E3-1270 CPU. Filesystem ext3, but the same happens with dd to raw partition.

Output of lsblk (sdc is HW RAID HDD, sda/sdb SW RAID SSDs):

NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb       8:16   0 223.6G  0 disk
|-sdb4    8:20   0     1K  0 part
|-sdb2    8:18   0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sdb5    8:21   0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sdb3    8:19   0     1G  0 part
|-sdb1    8:17   0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sdb6    8:22   0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp
sr0      11:0    1  1024M  0 rom
sdc       8:32   0   3.7T  0 disk
`-sdc1    8:33   0   3.7T  0 part  /home
sda       8:0    0 223.6G  0 disk
|-sda4    8:4    0     1K  0 part
|-sda2    8:2    0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sda5    8:5    0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sda3    8:3    0     1G  0 part
|-sda1    8:1    0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sda6    8:6    0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp

With oflag=direct the speed is OK, but the issue is that applications don't use direct io, so even plain cp is slow.

/sys/block/sdc/queue/hw_sector_size : 512
/sys/block/sdc/queue/max_segment_size : 65536
/sys/block/sdc/queue/physical_block_size : 512
/sys/block/sdc/queue/discard_max_bytes : 0
/sys/block/sdc/queue/rotational : 1
/sys/block/sdc/queue/iosched/fifo_batch : 16
/sys/block/sdc/queue/iosched/read_expire : 500
/sys/block/sdc/queue/iosched/writes_starved : 2
/sys/block/sdc/queue/iosched/write_expire : 5000
/sys/block/sdc/queue/iosched/front_merges : 1
/sys/block/sdc/queue/write_same_max_bytes : 0
/sys/block/sdc/queue/max_sectors_kb : 256
/sys/block/sdc/queue/discard_zeroes_data : 0
/sys/block/sdc/queue/read_ahead_kb : 128
/sys/block/sdc/queue/discard_max_hw_bytes : 0
/sys/block/sdc/queue/nomerges : 0
/sys/block/sdc/queue/max_segments : 64
/sys/block/sdc/queue/rq_affinity : 1
/sys/block/sdc/queue/iostats : 1
/sys/block/sdc/queue/dax : 0
/sys/block/sdc/queue/minimum_io_size : 512
/sys/block/sdc/queue/io_poll : 0
/sys/block/sdc/queue/max_hw_sectors_kb : 256
/sys/block/sdc/queue/add_random : 1
/sys/block/sdc/queue/optimal_io_size : 0
/sys/block/sdc/queue/nr_requests : 128
/sys/block/sdc/queue/scheduler : noop [deadline] cfq
/sys/block/sdc/queue/discard_granularity : 0
/sys/block/sdc/queue/logical_block_size : 512
/sys/block/sdc/queue/max_integrity_segments : 0
/sys/block/sdc/queue/write_cache : write through

Best Answer

The request size (avgrq-sz) is small because you are issuing small write requests. Your dd command, albeit specifing a 1 MB block size, is hitting the pagecache and so each 1 MB request really is a collection of 256 * 4 KB requests. This is reflected by the avgrq-sz which, being expressed in 512-bytes units, perfectly align with the 4 KB-sized page entry. Moreover, even SSD can have bad performance with synchronized writes, as requested by oflag=dsync. That said, please note that the I/O scheduler should merge these small 4 KB-sized requests in bigger ones, but this is not happening.

Some things to check:

  • what do you see issuing cat /sys/block/sdc/queue/scheduler? If noop is the selected scheduler, try to select deadline
  • is your /sys/block/sdc/queue/max_sectors_kb at least 1024?
  • try executing dd if=/dev/zero of=tst bs=1M oflag=direct: I/O performance should be much higher, right?