BBWC – Has It Ever Saved Your Data?

bbwcdisaster-recoveryhardware-raidstorage

I'm familiar with what a BBWC (Battery-backed write cache) is intended to do – and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I'm curious to understand whether it actually offers any real benefit in practice.

(NB I'm specifically looking for responses from people who have BBWC and had crashes/failures and whether the BBWC helped recovery or not)

Update

After the feedback here, I'm increasingly skeptical as whether a BBWC adds any value.

To have any confidence about data integrity, the filesystem MUST know when data has been committed to non-volatile storage (not necessarily the disk – a point I'll come back to). It's worth noting that a lot of disks lie about when data has been committed to the disk (http://brad.livejournal.com/2116715.html). While it seems reasonable to assume that disabling the on-disk cache might make the disks more honest, there's still no guarantee that this is the case either.

Due to the typcally large buffers in a BBWC, a barrier can require significantly more data to be commited to disk therefore causing delays on writes: the general advice is to disable barriers when using a non-volatile write back cache (and to disable on-disk caching). However this would appear to undermine the integrity of the write operation – just because more data is maintained in non-volatile storage does not mean that it will be more consistent. Indeed, arguably without demarcation between logical transactions there seems to be less opportunity to ensure consistency than otherwise.

If the BBWC were to acknowledge barriers at the point the data enters it's non-volatile storage (rather than being committed to disk) then it would appear to satisfy the data integrity requirement without a performance penalty – implying that barriers should still be enabled. However since these devices generally exhibit behaviour consistent with flushing the data to the physical device (significantly slower with barriers) and the widespread advice to disable barriers, they cannot therefore be behaving in this way. WHY NOT?

If the I/O in the OS is modelled as a series of streams then there is some scope to minimise the blocking effect of a write barrier when write caching is managed by the OS – since at this level only the logical transaction (a single stream) needs to be committed. On the other hand, a BBWC with no knowledge of which bits of data make up the transaction would have to commit its entire cache to disk. Whether the kernel/filesystems actually implement this in practice would require a lot more effort than I'm wiling to invest at the moment.

A combination of disks telling fibs about what has been committed and sudden loss of power undoubtedly leads to corruption – and with a Journalling or log structured filesystem which don't do a full fsck after an outage its unlikely that the corruption will be detected let alone an attempt made to repair it.

In terms of the modes of failure, in my experience most sudden power outages occur because of loss of mains power (easily mitigated with a UPS and managed shutdown). People pulling the wrong cable out of rack implies poor datacentre hygene (labelling and cable management). There are some types of sudden power loss event which are not prevented by a UPS – failure in the PSU or VRM a BBWC with barriers would provide data integrity in the event of a failure here, however how common are such events? Very rare judging by the lack of responses here.

Certainly moving the fault tolerance higher in the stack is significantly more expensive the a BBWC – however implementing a server as a cluster has lots of other benefits for performance and availability.

An alternative way to mitigate the impact of sudden power loss would be to implement a SAN – AoE makes this a practical proposition (I don't really see the point in iSCSI) but again there's a higher cost.

Best Answer

Sure. I've had battery-backed cache (BBWC) and later flash-backed write cache (FBWC) protect in-flight data following crashes and sudden power loss.

On HP ProLiant servers, the typical message is:

POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

Which means, "Hey, there's data in the write cache that survived the reboot/power-loss!! I'm going to write that back to disk now!!"

An interesting case was my post-mortem of a system that lost power during a tornado, the array sequence was:

POST Error: 1793-Drive Array - Array Accelerator Battery Depleted - Data Loss
POST Error: 1779-Drive Array Controller Detects Replacement Drives
POST Error: 1792-Drive Array Reports Valid Data Found in Array Accelerator

The 1793 POST error is unique. - While the system was in use, power was interrupted while data was in the Array Accelerator memory. However, due to the fact that this was a tornado, power was not restored within four days, so the array batteries were depleted and data within was lost. The server had two RAID controllers. The other controller had an FBWC unit, which lasts far longer than a battery. That drive recovered properly. Some data corruption resulted on the array backed by the empty battery.


Despite plenty of battery runtime at the facility, four days without power and hazardous conditions made it impossible for anyone to shut the servers down safely. enter image description here