I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.
Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.
As for other options or settings for block device queues:
max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)
nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)
rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic)"
scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.
NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.
Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :
fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.
write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.
read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.
front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.
writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.
See, here you err....
lets say both are 64kb, what happens when I run a write operation that's only a few bytes on
a section of the drive that's full?(perhaps a large sql server
database)
This is not possible.
Besides the fact that NTFS actually has blocks of 4kb or more - and it is STRONLY adviced to set that to 64kb for SQL Server.
SQL Server manages 8kb pages, and always reads /Writes 8 pages as extent - 64kb.
http://msdn.microsoft.com/en-us/library/ms190969(v=sql.105).aspx
As a result, for SQL Server, there is no thing such as writing a few bytes. It will write out 64kb.
As such, for SQ LServer, it is recommended to use a 64kb NTFS node size (so the extends do not cause split IO) and obviously a multiply of 64kb Raid strize size (as enterprise edition of SQL Server loves reading ahead exteds).
For other elements things are similar - not SQL Server... it depends on the intelligence of the programmer and the patterns of access of specific software.
Best Answer
If I interpret "Cache-Mode: Off" correctly it's completely understandable that write performance sucks. Check out whether copying/reading from the RAID (to network or NUL) is I problem or copying/writing to the RAID - I my guess is correct, only writing to the RAID is a pain.
RAID5 is distributed - each stripe consists of (in your case) three segments: data1, data2 and parity12. Now, when some data is written to the array it can't just be written to a data segment because the parity wouldn't match any more.
If data1 is written to/changed the controller needs to either:
So, whenever there's a change the controller operations are amplified by three! If these cannot be cached, each write will result in three operations to be performed and your application needs to wait. With cache, many read and write operations can be omitted and the performance hit will be much less.
The only exception to this write operation amplification is when you write a whole stripe at once: just take data1 and data2 from buffer, calculate parity12 and write all three segments. That's an amplification by just 1.5. However, for being able to combine all incoming data into full stripes you need to be able to queue the data. Guess what, you need cache again.
In a nutshell: if you use RAID5 or RAID6 you absolutely require cache - it's not a luxury. Too little or even no cache at all will kill your performance. If it's a software or hosted RAID with configurable cache, set aside at least 512 MB, better 1 or 2 GB and it'll "fly". RAID5 with three drives will be no performance wonder but it can work fine.
Edit: the HP ML10 G9 has a chipset-integrated Intel RST SATA RAID controller - host RAID. Depending on which exact model and controller is used, cache should be configurable somewhere.