Linux Performance – Resolve Dirty Pages Blocking Synchronous Writes

filesystemslinuxmemoryperformancesles11

We have processes doing background writes of big files. We would like those to have minimal impact on other processes.

Here is a test realised on SLES11 SP4. The server has massive memory, which allows it to create 4GB of dirty pages.

> dd if=/dev/zero of=todel bs=1048576 count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 3.72657 s, 1.2 GB/s
> dd if=/dev/zero of=zer oflag=sync bs=512 count=1  
1+0 records in
1+0 records out
512 bytes (512 B) copied, 16.6997 s, 0.0 kB/s

real    0m16.701s
user    0m0.000s
sys     0m0.000s
> grep Dirty /proc/meminfo
Dirty:           4199704 kB

This is my investigation so far:

  • SLES11 SP4 (3.0.101-63)
  • type ext3 (rw,nosuid,nodev,noatime)
  • deadline scheduler
  • over 120GB reclaimable memory at the time
  • dirty_ratio is set to 40% and dirty_background_ratio 10%, 30s expire, 5s writeback

Here are my questions:

  • having 4GB dirty memory at the end of the test, I conclude that the IO scheduler has not been called in above test. Is that right?
  • since the slowness persists after the first dd finishes, I conclude this issue has also nothing to do with the kernel allocating memory or any "copy on write" happening when dd fills his buffer (dd is always writing from the same buf).
  • is there a way to investigate deeper what is blocked? Any interesting counters to watch? Any idea on the source of the contention?
  • we are thinking of either reducing the dirty_ratio values, either performing the first dd in synchronous mode. Any other directions to investigate? Is there a drawback in putting the first dd synchronous? I'm afraid that it will be prioritized over other "legits" processes doing asynchronous writes.

see also

https://www.novell.com/support/kb/doc.php?id=7010287

limit linux background flush (dirty pages)

https://stackoverflow.com/questions/3755765/what-posix-fadvise-args-for-sequential-file-write/3756466?sgp=2#3756466

http://yarchive.net/comp/linux/dirty_limits.html


EDIT:

there is an ext2 file system under the same device. On this device, there is no freeze at all! The only performance impact experienced occurs during the flushing of dirty pages, where a synchronous call can take up to 0.3s, so very far from what we experience with our ext3 file system.


EDIT2:

Following @Matthew Ife comment, I tried doing the synchronous write opening the file without O_TRUNC and you won't believe the result!

> dd if=/dev/zero of=zer oflag=sync bs=512 count=1
> dd if=/dev/zero of=todel bs=1048576 count=4096
> dd if=/dev/zero of=zer oflag=sync bs=512 count=1 conv=notrunc
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000185427 s, 2.8 MB/s

dd was opening the file with parameters:

open("zer", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC, 0666) = 3

changing with the notrunc option, it is now

open("zer", O_WRONLY|O_CREAT|O_SYNC, 0666) = 3

and the synchronous write completes instantly!

Well it is not completely satisfying for my use case (I'm doing an msync in this fashion. However I am now able to trace what write and msync are doing differently!


final EDIT: I can't believe I hit this:
https://www.novell.com/support/kb/doc.php?id=7016100

In fact under SLES11 dd is opening the file with

open("zer", O_WRONLY|O_CREAT|O_DSYNC, 0666) = 3

and O_DSYNC == O_SYNC!

Conclusion:

For my usecase I should probably use

dd if=/dev/zero of=zer oflag=dsync bs=512 count=1 conv=notrunc

Under SLES11, running oflag=sync will really be running oflag=dsync no matter what strace is saying.

Best Answer

A couple of things I'd be interested to know the result of.

  1. initially creating the large file with fallocate then writing into it.

  2. Setting dirty_background_bytes much much lower (say 1GiB) and using CFQ as the scheduler. Note that in this test it might be a better representation to run the small in the middle of the big run.

So for option 1, you might find you avoid all the data=ordered semantics as the block allocation is done already (and quickly) because it was pre-allocated via fallocate and metadata is setup prior to the write. It would be useful to test if this really is the case. I have some confidence though it will improve performance.

For option 2, you can use ionice a bit more. Deadline is demonstrably faster than CFQ although CFQ attempts to organize IO per-process such that you find it gives you better share of the IO through each process.

I read somewhere (cant find a source now) that dirty_background_ratio will block writes against the individual committing process (effectively making the big process slower) to prevent one process starving all the others. Given how little information I can find now on that behaviour, I have less confidence this will work.

Oh: I should point out that fallocate relies on extents and you'll need to use ext4.

Related Topic