Write caching in linux

disk-cachehardware-raidhphp-prolianthp-smart-array

I have read a lot on serverfault and searched on google about write caching, but still cant find the answer.

I have an HP ProLiant DL380 G5 with HW raid with 512 MB Battery backed cache. I use debian linux.

There are 3 write caches:
1. OS write cache
2. HW raid with 512 MB Battery backed cache
3. HDD caches

My question is: How to configure them properly, so there will be no data loss during power loss?

I was thinking that disabling the OS write cache and HDD caches would solve the problem and it will be still performing well because of HW raid cache. Am I right?

The second question is about HW raid cache read/write ratio. I was thinking that since the OS RAM is used as a read cache, that it would be better to change the ratio of HR raid cache to 0/100 or maybe 0/80 (read/write). So it will be utilized better that 50/50 which I think is the default read/write cache ratio. What is the optimal values of this ratio?

Thank you

Best Answer

These systems are designed to just plug in and go. Here's how each tier handles I/O.

OS

Writes are cached briefly (dirty pages) in RAM while the I/O subsystem actually commits things. Once a write is committed, the page is then cached in case it is immediately read again. The OS Cache does not maintain a pool of uncomitted writes, it maintains a pool of already comitted writes that may need to be read again. It is, in effect, a 100% read cache.

RAID Controller

The BBC of the RAID controller receives the Write from the OS. Depending on the cache policy of the volume being written to (write-thru vs write-back), the RAID controller may report the write as Comitted at this time. It will then queue the write for comitting to actual disk

Disk

Some RAID cards actually do disable the HD cache. Others, don't. I don't remember how HP does theirs, but would not be surprised if the HD cache is disabled and the write-optimization logic is pushed up into the RAID controller itself; there is a reason HP uses custom firmware on their drives.


Operating systems, and the filesystems they support, know very well that sudden power-loss is a failure mode that can kill writes between the time the OS determines that it needs to happen and when the storage system reports it is done. We've been doing this a while now, and we're pretty good at defending against it.

The XFS filesystem has a bad reputation for survivability in sudden power-loss situations due to how it handles metadata writes. But then, it's intended environment is one where power is presumed to be adequately redundant. Other filesystems, the ext series, btrfs, and of course zfs, survive that just fine as well.


If you're operating in an environment with known bad power, to ensure no data loss during power outages:

  • Use a filesystem known to be robust for sudden power loss (basically, anything but XFS)

And that's it. The BBC on the RAID card ensures the RAID cache is preserved until power is restored. The disk caches are likely disabled. No need to tune the RAID card cache to be all-read. No need to disable the OS block caches.

Really.