I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.
Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.
As for other options or settings for block device queues:
max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)
nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)
rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs
If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic)"
scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.
NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.
Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :
fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.
write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.
read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.
front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.
writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.
You asked about two different targets (hardware card or software mdadm) so I'll answer them each separately.
As for moving between cards, going from LSI to LSI usually works great. In my experience, transplanting an entire array from an older series LSI card to a newer 9620 with absolutely no hiccups. The controllers understand the metadata well enough and import the correct configuration. If it doesn't import the configuration correctly, just back out of the BIOS tool without making any changes, and plug the old card in.
If you have 50% redundancy in your array (e.g. a 2 disk RAID1), it can't hurt to take one disk out and plug it into the new card. The system should pick up the configuration from this disk alone. If you are able to boot into it and see data, you are all set. Just add the other disk to the new card too, and let it rebuild.
As for software raid, depending on the number of disks and the type of RAID, LSI's setup allows you to mount disks directly from command line in linux. I've taken apart a RAID1 from an LSI 9620 (identical to your SMC2108), plugged one disk right into the motherboard, and booted. If you have a RAID5 or RAID10, obviously that would not work quite so well.
A best course of action would be to use extra hard disks to make your mdadm raid in the correct size and configuration, then copy data over.
Best Answer
Probably not. You've added a foreign component to the HP system. It can't be expected to work the same way as native parts.
My experience with LSI controllers running the internal disks of a ProLiant server is that the disk LED activity is random or may not correlate to what's actually happening. But in the end, it really doesn't matter.
It's worth noting that your (SATA) SSDs are probably running at a lower link speed than you expect. The G5 backplane (and hardware) is very old, so yeah...