Linux Performance – Resolve Dirty Pages Blocking Synchronous Writes

filesystemslinuxmemoryperformancesles11

We have processes doing background writes of big files. We would like those to have minimal impact on other processes.

Here is a test realised on SLES11 SP4. The server has massive memory, which allows it to create 4GB of dirty pages.

> dd if=/dev/zero of=todel bs=1048576 count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 3.72657 s, 1.2 GB/s
> dd if=/dev/zero of=zer oflag=sync bs=512 count=1  
1+0 records in
1+0 records out
512 bytes (512 B) copied, 16.6997 s, 0.0 kB/s

real    0m16.701s
user    0m0.000s
sys     0m0.000s
> grep Dirty /proc/meminfo
Dirty:           4199704 kB

This is my investigation so far:

SLES11 SP4 (3.0.101-63)
type ext3 (rw,nosuid,nodev,noatime)
deadline scheduler
over 120GB reclaimable memory at the time
dirty_ratio is set to 40% and dirty_background_ratio 10%, 30s expire, 5s writeback

Here are my questions:

having 4GB dirty memory at the end of the test, I conclude that the IO scheduler has not been called in above test. Is that right?
since the slowness persists after the first dd finishes, I conclude this issue has also nothing to do with the kernel allocating memory or any "copy on write" happening when dd fills his buffer (dd is always writing from the same buf).
is there a way to investigate deeper what is blocked? Any interesting counters to watch? Any idea on the source of the contention?
we are thinking of either reducing the dirty_ratio values, either performing the first dd in synchronous mode. Any other directions to investigate? Is there a drawback in putting the first dd synchronous? I'm afraid that it will be prioritized over other "legits" processes doing asynchronous writes.

Best Answer

A couple of things I'd be interested to know the result of.

initially creating the large file with fallocate then writing into it.
Setting dirty_background_bytes much much lower (say 1GiB) and using CFQ as the scheduler. Note that in this test it might be a better representation to run the small in the middle of the big run.

So for option 1, you might find you avoid all the data=ordered semantics as the block allocation is done already (and quickly) because it was pre-allocated via fallocate and metadata is setup prior to the write. It would be useful to test if this really is the case. I have some confidence though it will improve performance.

For option 2, you can use ionice a bit more. Deadline is demonstrably faster than CFQ although CFQ attempts to organize IO per-process such that you find it gives you better share of the IO through each process.

I read somewhere (cant find a source now) that dirty_background_ratio will block writes against the individual committing process (effectively making the big process slower) to prevent one process starving all the others. Given how little information I can find now on that behaviour, I have less confidence this will work.

Oh: I should point out that fallocate relies on extents and you'll need to use ext4.

Related Solutions

Bad disk performance on HP DL360 with Smart Array P400i RAID controller

The write performance on that particular controller is usually poor unless you also have the battery unit for the cache. In addition, reconfiguring the array as a RAID 1+0 would give you the same amount of space and better overall performance.

Are you testing this from the ESXi console or from within a VM?

Linux Performance – Limiting Background Flush (Dirty Pages)

After lots of benchmarking with sysbench, I come to this conclusion:

To survive (performance-wise) a situation where

an evil copy process floods dirty pages
and hardware write-cache is present (possibly also without that)
and synchronous reads or writes per second (IOPS) are critical

just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.

Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes).

This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.

Specifications and benchmarks for comparison:

Tested while dd'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.

Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).

dd call:

dd if=/dev/zero of=dumpfile bs=1024 count=20485672

for sysbench, the test_file.0 was prepared to be unsparse with:

dd if=/dev/zero of=test_file.0 bs=1024 count=10485672

sysbench call for 10 threads:

sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

sysbench call for one thread:

sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

Smaller block sizes showed even more drastic numbers.

--file-block-size=4096 with 1 GB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 30 Write, 30 Other = 60 Total
Read 0b  Written 120Kb  Total transferred 120Kb  (3.939Kb/sec)
      0.98 Requests/sec executed

Test execution summary:
      total time:                          30.4642s
      total number of events:              30
      total time taken by event execution: 30.4639
      per-request statistics:
           min:                                 94.36ms
           avg:                               1015.46ms
           max:                               1591.95ms
           approx.  95 percentile:            1591.30ms

Threads fairness:
      events (avg/stddev):           30.0000/0.00
      execution time (avg/stddev):   30.4639/0.00

--file-block-size=4096 with 15 MB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 13524 Write, 13524 Other = 27048 Total
Read 0b  Written 52.828Mb  Total transferred 52.828Mb  (1.7608Mb/sec)
    450.75 Requests/sec executed

Test execution summary:
      total time:                          30.0032s
      total number of events:              13524
      total time taken by event execution: 29.9921
      per-request statistics:
           min:                                  0.10ms
           avg:                                  2.22ms
           max:                                145.75ms
           approx.  95 percentile:              12.35ms

Threads fairness:
      events (avg/stddev):           13524.0000/0.00
      execution time (avg/stddev):   29.9921/0.00

--file-block-size=4096 with 15 MB dirty_bytes on idle system:

sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 43801 Write, 43801 Other = 87602 Total
Read 0b  Written 171.1Mb  Total transferred 171.1Mb  (5.7032Mb/sec)
 1460.02 Requests/sec executed

Test execution summary:
      total time:                          30.0004s
      total number of events:              43801
      total time taken by event execution: 29.9662
      per-request statistics:
           min:                                  0.10ms
           avg:                                  0.68ms
           max:                                275.50ms
           approx.  95 percentile:               3.28ms

Threads fairness:
      events (avg/stddev):           43801.0000/0.00
      execution time (avg/stddev):   29.9662/0.00

Test-System:

Adaptec 5405Z (that's 512 MB write-cache with protection)
Intel Xeon L5520
6 GiB RAM @ 1066 MHz
Motherboard Supermicro X8DTN (5520 chipset)
12 Seagate Barracuda 1 TB disks
- 10 in Linux software RAID 10
Kernel 2.6.32
Filesystem xfs
Debian unstable

In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.

Best Answer

Related Solutions

Bad disk performance on HP DL360 with Smart Array P400i RAID controller

Linux Performance – Limiting Background Flush (Dirty Pages)

Related Topic