Linux Performance – Limiting Background Flush (Dirty Pages)

hard drivelinuxperformance

Background flushing on Linux happens when either too much written data is pending (adjustable via /proc/sys/vm/dirty_background_ratio) or a timeout for pending writes is reached (/proc/sys/vm/dirty_expire_centisecs). Unless another limit is being hit (/proc/sys/vm/dirty_ratio), more written data may be cached. Further writes will block.

In theory, this should create a background process writing out dirty pages without disturbing other processes. In practice, it does disturb any process doing uncached reading or synchronous writing. Badly. This is because the background flush actually writes at 100% device speed and any other device requests at this time will be delayed (because all queues and write-caches on the road are filled).

Is there a way to limit the amount of requests per second the flushing process performs, or otherwise effectively prioritize other device I/O?

Best Answer

After lots of benchmarking with sysbench, I come to this conclusion:

To survive (performance-wise) a situation where

an evil copy process floods dirty pages
and hardware write-cache is present (possibly also without that)
and synchronous reads or writes per second (IOPS) are critical

just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.

Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes).

This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.

Specifications and benchmarks for comparison:

Tested while dd'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.

Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).

dd call:

dd if=/dev/zero of=dumpfile bs=1024 count=20485672

for sysbench, the test_file.0 was prepared to be unsparse with:

dd if=/dev/zero of=test_file.0 bs=1024 count=10485672

sysbench call for 10 threads:

sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

sysbench call for one thread:

sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run

Smaller block sizes showed even more drastic numbers.

--file-block-size=4096 with 1 GB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 30 Write, 30 Other = 60 Total
Read 0b  Written 120Kb  Total transferred 120Kb  (3.939Kb/sec)
      0.98 Requests/sec executed

Test execution summary:
      total time:                          30.4642s
      total number of events:              30
      total time taken by event execution: 30.4639
      per-request statistics:
           min:                                 94.36ms
           avg:                               1015.46ms
           max:                               1591.95ms
           approx.  95 percentile:            1591.30ms

Threads fairness:
      events (avg/stddev):           30.0000/0.00
      execution time (avg/stddev):   30.4639/0.00

--file-block-size=4096 with 15 MB dirty_bytes:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 13524 Write, 13524 Other = 27048 Total
Read 0b  Written 52.828Mb  Total transferred 52.828Mb  (1.7608Mb/sec)
    450.75 Requests/sec executed

Test execution summary:
      total time:                          30.0032s
      total number of events:              13524
      total time taken by event execution: 29.9921
      per-request statistics:
           min:                                  0.10ms
           avg:                                  2.22ms
           max:                                145.75ms
           approx.  95 percentile:              12.35ms

Threads fairness:
      events (avg/stddev):           13524.0000/0.00
      execution time (avg/stddev):   29.9921/0.00

--file-block-size=4096 with 15 MB dirty_bytes on idle system:

sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.

Operations performed:  0 Read, 43801 Write, 43801 Other = 87602 Total
Read 0b  Written 171.1Mb  Total transferred 171.1Mb  (5.7032Mb/sec)
 1460.02 Requests/sec executed

Test execution summary:
      total time:                          30.0004s
      total number of events:              43801
      total time taken by event execution: 29.9662
      per-request statistics:
           min:                                  0.10ms
           avg:                                  0.68ms
           max:                                275.50ms
           approx.  95 percentile:               3.28ms

Threads fairness:
      events (avg/stddev):           43801.0000/0.00
      execution time (avg/stddev):   29.9662/0.00

Test-System:

Adaptec 5405Z (that's 512 MB write-cache with protection)
Intel Xeon L5520
6 GiB RAM @ 1066 MHz
Motherboard Supermicro X8DTN (5520 chipset)
12 Seagate Barracuda 1 TB disks
- 10 in Linux software RAID 10
Kernel 2.6.32
Filesystem xfs
Debian unstable

In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.

Related Solutions

Linux Hard Disk Load – How to Monitor Hard Disk Load on Linux

You can get a pretty good measure of this using the iostat tool.

% iostat -dx /dev/sda 5

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.78    11.03    1.19    2.82    72.98   111.07    45.80     0.13   32.78   1.60   0.64

The disk utilisation is listed in the last column. This is defined as

Percentage of CPU time during which I/O requests were issued to the device (band-width utilization for the device). Device saturation occurs when this value is close to 100%.

Linux – How to run a server on port 80 as a normal user on Linux

Short answer: you can't. Ports below 1024 can be opened only by root. As per comment - well, you can, using CAP_NET_BIND_SERVICE, but that approach, applied to java bin will make any java program to be run with this setting, which is undesirable, if not a security risk.

The long answer: you can redirect connections on port 80 to some other port you can open as normal user.

Run as root:

# iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080

As loopback devices (like localhost) do not use the prerouting rules, if you need to use localhost, etc., add this rule as well (thanks @Francesco):

# iptables -t nat -I OUTPUT -p tcp -d 127.0.0.1 --dport 80 -j REDIRECT --to-ports 8080

NOTE: The above solution is not well suited for multi-user systems, as any user can open port 8080 (or any other high port you decide to use), thus intercepting the traffic. (Credits to CesarB).

EDIT: as per comment question - to delete the above rule:

# iptables -t nat --line-numbers -n -L

This will output something like:

Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination         
1    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:8080 redir ports 8088
2    REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:80 redir ports 8080

The rule you are interested in is nr. 2, so to delete it:

# iptables -t nat -D PREROUTING 2

Best Answer

Related Solutions

Linux Hard Disk Load – How to Monitor Hard Disk Load on Linux

Linux – How to run a server on port 80 as a normal user on Linux

Related Topic