Linux – Tuning Linux disk caching behaviour for maximum throughput

cacheftplinuxstorage

I'm running into a maximum throughput issue here and need some advice on which way to tune my knobs. We're running a 10Gbit fileserver for backup distribution. It's a two disk S-ATA2 setup on an LSI MegaRAID Controller. The server also got 24gig of memory.

We have a need to mirror our last uploaded backup with maximum throughput.

The RAID0 for our "hot" backups gives us around 260 MB/sec write and 275 MB/sec read. A tested tmpfs with size 20GB gives us around 1GB/sec. This kind of throughput is what we need.

Now how can I tune the virtual memory subsystem of Linux to cache the last uploaded files for as long as possible in memory without writing them out to disk (or even better: writing to disk AND keeping them in memory)?

I setup the following sysctls, but they dont give us the throughput we expect:

# VM pressure fixes
vm.swappiness = 20
vm.dirty_ratio = 70
vm.dirty_background_ratio = 30
vm.dirty_writeback_centisecs = 60000

This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk. Still when I benchmark the server I see no effect on writing, the throughput doesnt increase.

Help or advice needed.

Best Answer

By the look at the variables you've set, it seems like you are mostly concerned with write performance and do not care about possible data losses due to power outages.

You only will ever get the option for lazy writes and the use of a writeback cache with asynchronous write operations. Synchronous write operations require committing to disk and would not be lazy-written - ever. Your filesystem might be causing frequent page flushes and synchronous writes (typically due to journalling, especially with ext3 in data=journal mode). Additionally, even "background" page flushes will interfere with uncached reads and synchronous writes, thus slowing them down.

In general, you should take some metrics to see what is happening - do you see your copy process put in "D" state waiting for I/O work to be done by pdflush? Do you see heavy synchronous write activity on your disks?

If all else fails, you might choose to set up an explicit tmpfs filesystem where you copy your backups to and just synchronize data with your disks after the fact - even automatically using inotify

For read caching things are significantly simpler - there is the fcoretools fadvise utility which has the --willneed parameter to advise the kernel to load the file's contents into the buffer cache.

Edit:

vm.dirty_ratio = 70

This should in theory give us 16GB for caching I/O and wait some minutes until its writing to disk.

This would not have greatly influenced your testing scenario, but there is a misconception in your understanding. The dirty_ratio parameter is not a percentage of your system's total memory but rather of your system's free memory.

There is an article about Tuning for Write-Heavy loads with more in-depth information.