I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.
Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.
As for other options or settings for block device queues:
max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)
nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)
rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic)"
scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.
NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.
Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :
fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.
write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.
read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.
front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.
writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.
If you use the a starting sector of 2048 (512 byte) sectors, then your partition will start 1MB into the drive. This value is used as the by default on most newer installers. This number is nicely divisible by 64k, and most other common chunk/block sizes.
If you are partitioning with fdisk then make pass the -u
flag. So it reports the values in 512 byte sectors instead of cylinders.
Since you are using ext* you can use this calculator to determine the strip size and stride width for the filesystem. I am showing that you would want to create your filesystem with these options: mkfs.ext3 -b 4096 -E stride=16,stripe-width=48
. You might want to try just creating the filesystem without passing options and seeing what mkfs detects and uses (check with tune2fs -l /dev/sdnn
). These days it seems to do a pretty good job automatically detecting the size/width.
Best Answer
Your filesystem block size should be something that divides you RAID stripe size to produce an integer value, so 4K filesystem blocks with a 256K stripe size is fine.
For most efficient operation, especially with striped RAID setups like RAID5, you need to tell the filesystem what the controllers stripe size is so it can align its structures in a way that avoids them overlapping the edge of RAID stripes: for XFS this is done by giving the
sunit=
orsu=
options when you create the filesystem (seeman mkfs.xfs
for more details and options).You also need to make sure that the filesystem itself starts on a block at the start of a RAID stripe, otherwise the filesystem's attempts to align things efficiently will be thwarted. Because the RAID volume presented to the BIOS will have the boot sector at the start, this will mean starting the first partition 256Kb into the array (unless you are creating the filesystem on the raw device, i.e. on
sdb
rather than partitioning the array intosdb1
and so forth, in which case the filesystem will start in block 0 - but you can't do this if you plan to boot off the array).