Hardware RAID controller cache battery failure frequency/lifetime

batterycachehardwarehardware-raid

I'm in an environment that contains many Supermicro servers equipped with Adaptec and LSI MegaRAID hardware RAID controllers. These controllers contain battery-backed cache modules to help boost write performance and protect data in-transit.

A frequent support issues is RAID controller battery failure. This shifts the array from write-back to write-through mode. There's clearly a negative performance impact as the system runs with degraded write speed. This persists until a downtime window can be established to power the system down and replace the battery.

This is a very routine operation for us; almost weekly across several thousand physical servers… We even have charging stations in place to prep replacement batteries so that can be swapped-in without a charge cycle.

Perhaps I'm spoiled by a long history with HP ProLiant servers and Smart Array RAID controllers, but HP systems typically had battery lifetimes of 4-6 years. They eventually eliminated the use of RAID batteries around 2009. They were replaced with supercapacitor-backed memory modules (flash-backed write cache, or FBWC) and don't require replacement, disposal or a lengthy initial charge cycle.

Since I see the Adaptec and LSI controller battery failures sometimes occurring on systems that have been in service for less than 12 months, I wonder if this is common in other environments.

If this is common, how do other large server environments handle this?

  • Any tips or tricks to handling RAID battery replacements?
  • Are there any configuration parameters that can help?
  • How disruptive is this to operations in your environment?
  • Could poor chassis cooling and temperature be a factor?
  • Are we doing something wrong?
  • Dell PERC controllers are made by LSI. Do Dell environments experience the same short battery lifetimes?

LSI product literature outlining a new-generation battery that can last longer in service than 1 year.
enter image description here

HP ProLiant DL585 G2 server with 1000+ day uptime and a happy RAID battery…

# uptime 
 05:38:08 up 1031 days, 44 min, 31 users,  load average: 0.49, 0.64, 0.99

# hpacucli
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 50% Read / 50% Write
   Total Cache Size: 512 MB
   Battery Pack Count: 1
   Battery Status: OK

Best Answer

I suspect your Supermicros are broken one way or the other - possibly the battery packs are overheating. Most recent LSIs would report the temperature through MegaCLI - you might want to monitor this value on servers which needed replacement.

root@host:~/SOLARIS# ./MegaCli -AdpBbuCmd -GetBbuStatus -aALL

BBU status for Adapter: 0

BatteryType: BBU
[...]
Temperature: 41 C

I have seen a couple of Dell and Fujitsu systems with LSI BBU controllers, none of them had yearly battery pack replacement (except you screwed the pack up by deep-discharge). The typical life time has been around 3 to 5 years.