After lots of benchmarking with sysbench, I come to this conclusion:
To survive (performance-wise) a situation where
- an evil copy process floods dirty pages
- and hardware write-cache is present (possibly also without that)
- and synchronous reads or writes per second (IOPS) are critical
just dump all elevators, queues and dirty page caches. The correct place for dirty pages is in the RAM of that hardware write-cache.
Adjust dirty_ratio (or new dirty_bytes) as low as possible, but keep an eye on sequential throughput. In my particular case, 15 MB were optimum (echo 15000000 > dirty_bytes
).
This is more a hack than a solution because gigabytes of RAM are now used for read caching only instead of dirty cache. For dirty cache to work out well in this situation, the Linux kernel background flusher would need to average at what speed the underlying device accepts requests and adjust background flushing accordingly. Not easy.
Specifications and benchmarks for comparison:
Tested while dd
'ing zeros to disk, sysbench showed huge success, boosting 10 threads fsync writes at 16 kB from 33 to 700 IOPS (idle limit: 1500 IOPS) and single thread from 8 to 400 IOPS.
Without load, IOPS were unaffected (~1500) and throughput slightly reduced (from 251 MB/s to 216 MB/s).
dd
call:
dd if=/dev/zero of=dumpfile bs=1024 count=20485672
for sysbench, the test_file.0 was prepared to be unsparse with:
dd if=/dev/zero of=test_file.0 bs=1024 count=10485672
sysbench call for 10 threads:
sysbench --test=fileio --file-num=1 --num-threads=10 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
sysbench call for one thread:
sysbench --test=fileio --file-num=1 --num-threads=1 --file-total-size=10G --file-fsync-all=on --file-test-mode=rndwr --max-time=30 --file-block-size=16384 --max-requests=0 run
Smaller block sizes showed even more drastic numbers.
--file-block-size=4096 with 1 GB dirty_bytes:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 30 Write, 30 Other = 60 Total
Read 0b Written 120Kb Total transferred 120Kb (3.939Kb/sec)
0.98 Requests/sec executed
Test execution summary:
total time: 30.4642s
total number of events: 30
total time taken by event execution: 30.4639
per-request statistics:
min: 94.36ms
avg: 1015.46ms
max: 1591.95ms
approx. 95 percentile: 1591.30ms
Threads fairness:
events (avg/stddev): 30.0000/0.00
execution time (avg/stddev): 30.4639/0.00
--file-block-size=4096 with 15 MB dirty_bytes:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 13524 Write, 13524 Other = 27048 Total
Read 0b Written 52.828Mb Total transferred 52.828Mb (1.7608Mb/sec)
450.75 Requests/sec executed
Test execution summary:
total time: 30.0032s
total number of events: 13524
total time taken by event execution: 29.9921
per-request statistics:
min: 0.10ms
avg: 2.22ms
max: 145.75ms
approx. 95 percentile: 12.35ms
Threads fairness:
events (avg/stddev): 13524.0000/0.00
execution time (avg/stddev): 29.9921/0.00
--file-block-size=4096 with 15 MB dirty_bytes on idle system:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Extra file open flags: 0
1 files, 10Gb each
10Gb total file size
Block size 4Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Calling fsync() after each write operation.
Using synchronous I/O mode
Doing random write test
Threads started!
Time limit exceeded, exiting...
Done.
Operations performed: 0 Read, 43801 Write, 43801 Other = 87602 Total
Read 0b Written 171.1Mb Total transferred 171.1Mb (5.7032Mb/sec)
1460.02 Requests/sec executed
Test execution summary:
total time: 30.0004s
total number of events: 43801
total time taken by event execution: 29.9662
per-request statistics:
min: 0.10ms
avg: 0.68ms
max: 275.50ms
approx. 95 percentile: 3.28ms
Threads fairness:
events (avg/stddev): 43801.0000/0.00
execution time (avg/stddev): 29.9662/0.00
Test-System:
- Adaptec 5405Z (that's 512 MB write-cache with protection)
- Intel Xeon L5520
- 6 GiB RAM @ 1066 MHz
- Motherboard Supermicro X8DTN (5520 chipset)
- 12 Seagate Barracuda 1 TB disks
- 10 in Linux software RAID 10
- Kernel 2.6.32
- Filesystem xfs
- Debian unstable
In summary, I am now sure this configuration will perform well in idle, high load and even full load situations for database traffic that otherwise would have been starved by sequential traffic. Sequential throughput is higher than two gigabit links can deliver anyway, so no problem reducing it a bit.
Best Answer
I am answering to the linux tag. My answer is specific only to Linux.
Yes, huge pages are more prone to fragmentation. There are two views of memory, the one your process gets (virtual) and the one the kernel manages (real). The larger any page, the more difficult it's going to be to group (and keep it with) its neighbors, especially when your service is running on a system that also has to support others that by default allocate and write to way more memory than they actually end up using.
The kernel's mapping of (real) granted addresses is private. There's a very good reason why userspace sees them as the kernel presents them, because the kernel needs to be able to overcommit without confusing userspace. Your process gets a nice, contiguous "Disneyfied" address space in which to work, oblivious of what the kernel is actually doing with that memory behind the scenes.
The reason you see degraded performance on long running servers is most likely because allocated blocks that have not been explicitly locked (e.g.
mlock()
/mlockall()
orposix_madvise()
) and not modified in a while have been paged out, which means your service skids to disk when it has to read them. Modifying this behavior makes your process a bad neighbor, which is why many people put their RDBMS on a completely different server than web/php/python/ruby/whatever. The only way to fix that, sanely, is to reduce the competition for contiguous blocks.Fragmentation is only really noticeable (in most cases) when page A is in memory and page B has moved to swap. Naturally, re-starting your service would seem to 'cure' this, but only because the kernel has not yet had an opportunity to page out the process' (now) newly allocated blocks within the confines of its overcommit ratio.
In fact, re-starting (lets say) 'apache' under a high load is likely going to send blocks owned by other services straight to disk. So yes, 'apache' would improve for a short time, but 'mysql' might suffer .. at least until the kernel makes them suffer equally when there is simply a lack of ample physical memory.
Add more memory, or split up demanding
malloc()
consumers :) Its not just fragmentation that you need to be looking at.Try
vmstat
to get an overview of what's actually being stored where.