Linux Hardware RAID Controller Tuning – Real-World Guide

hardware-raidhphp-smart-arrayperformance-tuningstorage

Most of the Linux systems I manage feature hardware RAID controllers (mostly HP Smart Array). They're all running RHEL or CentOS.

I'm looking for real-world tunables to help optimize performance for setups that incorporate hardware RAID controllers with SAS disks (Smart Array, Perc, LSI, etc.) and battery-backed or flash-backed cache. Assume RAID 1+0 and multiple spindles (4+ disks).

I spend a considerable amount of time tuning Linux network settings for low-latency and financial trading applications. But many of those options are well-documented (changing send/receive buffers, modifying TCP window settings, etc.). What are engineers doing on the storage side?

Historically, I've made changes to the I/O scheduling elevator, recently opting for the deadline and noop schedulers to improve performance within my applications. As RHEL versions have progressed, I've also noticed that the compiled-in defaults for SCSI and CCISS block devices have changed as well. This has had an impact on the recommended storage subsystem settings over time. However, it's been awhile since I've seen any clear recommendations. And I know that the OS defaults aren't optimal. For example, it seems that the default read-ahead buffer of 128kb is extremely small for a deployment on server-class hardware.

The following articles explore the performance impact of changing read-ahead cache and nr_requests values on the block queues.

http://zackreed.me/articles/54-hp-smart-array-p410-controller-tuning

http://www.overclock.net/t/515068/tuning-a-hp-smart-array-p400-with-linux-why-tuning-really-matters

http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html

For example, these are suggested changes for an HP Smart Array RAID controller:

echo "noop" > /sys/block/cciss\!c0d0/queue/scheduler 
blockdev --setra 65536 /dev/cciss/c0d0
echo 512 > /sys/block/cciss\!c0d0/queue/nr_requests
echo 2048 > /sys/block/cciss\!c0d0/queue/read_ahead_kb

What else can be reliably tuned to improve storage performance?

I'm specifically looking for sysctl and sysfs options in production scenarios.

Best Answer

I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.

Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.

As for other options or settings for block device queues:

max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)

nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)

rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs

If this option is '1', the block layer will migrate request completions to the cpu "group" that originally submitted the request. For some workloads this provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion processing setting this option to '2' forces the completion to run on the requesting cpu (bypassing the "group" aggregation logic)"

scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.

NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.

Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :

fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.

write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.

read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.

front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.

writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.

Related Solutions

HP SmartArray – Differences Between cciss and hpsa Linux Drivers

HP has a good write-up of what the differences are here:

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02677069/c02677069.pdf (PDF)

High points:

Puts devices in the standard /dev namespace, which you already noticed.
Modernized interaction with the SCSI layer in newer kernels.
hpsa is a SCSI driver, cciss is a block-driver.
- This will change device numbering, if that matters.
- This is why the /dev namespace changes
The /sys controls for the driver will change.
Older cards (before the P400-era cards) still require CCISS
It may be the case that cciss and hpsa will both load if cards requiring them are present.

The Windows side is untouched.

Slow DL360 Smart Array P400i RAID

This is several questions in one, so I'll try to address a few of them.

I typically set Smart Array controllers to leverage a higher write cache ratio. I prefer to have 75% write cache because the OS (using the XFS filesystem) caches aggressively. XFS will make a difference, but what are you tuning for? Are you tuning to simply achieve specific numbers, or is there an application requirement driving this?

ext3 isn't the fastest filesystem out there. But you have some mount options (e.g. noatime) and journal settings you could tweak.
I don't use LVM, especially with HP controllers that can provide many of the same benefits.
You have I/O scheduler and elevator settings (e.g. noop or deadline, in this case) that can be tuned, but that's a function of your application's actual needs.

If you do go with XFS, try a basic config then try some advanced configuration settings. Over time, I've ended up with mount parameters very similar to the one in the original link.

I just ran the following iozone command line on an XFS partition contained within a DL380 G5 with P400i, 12GB RAM and 8 x 146GB 10k drives. The elevator is set to deadline:

Command line used: iozone -t1 -i0 -i1 -i2 -r1m -s24g

initial writers  =  348957.75 KB/sec
rewriters        =  335130.03 KB/sec
readers          =  132851.70 KB/sec
re-readers       =  137116.27 KB/sec
random readers   =   35774.41 KB/sec
random writers   =  250618.38 KB/sec

Best Answer

Related Solutions

HP SmartArray – Differences Between cciss and hpsa Linux Drivers

Slow DL360 Smart Array P400i RAID

Related Topic