Ubuntu – LSI 3Ware tw_cli and tdm2 segfault with Debian Linux kernels after 3.8

3waredebianUbuntu

I have notified LSI support, twice, but so far they are unable to reproduce the problem. I wanted to post here to get some unbiased expert thoughts about it and see if anyone else has seen a similar problem.

We manage a number of servers that supply Internet services with very heavy disk IO. All run Debian testing (Sid)-amd64 and use 3ware RAID cards from the 85xx – 96xx series. With the Debian kernel updates to 3.9.x-amd64 we started getting a segfault with tw_cli. We tested tdm2 and it also segfaults.

To reproduce the problem: (You don't need a RAID card in your system to do this)
1. Fresh install of Debian testing (Sid). ISO is http://cdimage.debian.org/cdimage/weekly-builds/amd64/iso-cd/
2. Install tw_cli and try to run it.

We ran tw_cli as root with strace under 3.2 and 3.9.6/3.9.8-amd64 and the segfault is happening right after tw_cli calls uname as you can see below.

Good Run:

execve("/usr/local/sbin/tw_cli", ["tw_cli", "/c0", "show", "all"], ["TERM=xterm", "SHELL=/bin/bash", "SSH_CLIENT=71.207.183.174 60609 "..., "SSH_TTY=/dev/pts/0", "USER=root", "MAIL=/var/mail/root", "PATH=/usr/local/sbin:/usr/local/"..., "PWD=/root", "LANG=C", "PS1=\\h:\\w\\$ ", "SHLVL=1", "HOME=/root", "LOGNAME=root", "SSH_CONNECTION=71.207.183.174 60"..., "_=/usr/bin/strace"]) = 0
uname({sysname="Linux", nodename="yorick.ironicdesign.com", release="3.2.0-4-amd64", version="#1 SMP Debian 3.2.46-1", machine="x86_64"}) = 0
brk(0)                                  = 0x2664000
brk(0x2685000)                          = 0x2685000
uname({sysname="Linux", nodename="yorick.ironicdesign.com", release="3.2.0-4-amd64", version="#1 SMP Debian 3.2.46-1", machine="x86_64"}) = 0
open("/proc/devices", O_RDONLY)         = 3
...

Bad run:

execve("/usr/local/sbin/tw_cli", ["tw_cli", "/c0", "show", "all"], ["SHELL=/bin/bash", "TERM=screen", "SSH_CLIENT=98.26.9.112 58271 22", "SSH_TTY=/dev/pts/0", "USER=root", "SSH_AUTH_SOCK=/tmp/ssh-595iwzIik"..., "TERMCAP=SC|screen|VT 100/ANSI X3"..., "PATH=/usr/local/sbin:/usr/local/"..., "MAIL=/var/mail/root", "STY=17473.mdorman", "PWD=/root", "LANG=C", "PS1=\\h:\\w\\$ ", "HOME=/root", "SHLVL=2", "LOGNAME=root", "WINDOW=0", "SSH_CONNECTION=98.26.9.112 58271"..., "_=/usr/bin/strace"]) = 0
uname({sysname="Linux", nodename="yorick.ironicdesign.com", release="3.10-1-amd64", version="#1 SMP Debian 3.10.1-1 (2013-07-16)", machine="x86_64"}) = 0
brk(0)                                  = 0x26ef000
brk(0x2710000)                          = 0x2710000
uname({sysname="Linux", nodename="yorick.ironicdesign.com", release="3.10-1-amd64", version="#1 SMP Debian 3.10.1-1 (2013-07-16)", machine="x86_64"}) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

In the good run above, the next call after uname is to open /proc/devices which DOES exist and should not be a problem. Something else we think is notable and you can see it in the bad run above, uname in the 3.9/3.10 kernel adds a date to the string.

We think these two strace runs may indicate tw_cli is crashing because it is getting an unexpected response from the uname call. LSI support says:

"3dm2 and tw_cli work fine even with Ubuntu latest kernels 3.10.x and Ubuntu usually pulls unstable kernels from Debian and use it for their releases."

FWIW, I am not sure what LSI support is talking about. We just tested with a fresh, up-to-date install of Ubuntu 1304 (Raring Ringtail) and uname -a shows "Linux mac-workstation 3.8.0-26-generic #38-Ubuntu SMP Mon Jun 17 21:43:33 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux". So Ubuntu 1304 is using the 3.8 kernel, not 3.10. And tw_cli & tdm2 both work fine.

So any helpful thoughts? At the moment our options appear to be:
– pin our kernel version to 3.2 and hope whatever the problem is gets fixed soon
– stop monitoring our RAIDs (not really an option)
– compile custom kernels for all our servers because apparently the stock Debian Testing kernel has this problem
– switch to Ubuntu for all our servers (not feasible)
– switch our RAID cards to someone like Areca (also not feasible for existing servers, but is being considered for our next server generation)

=================== followup ============================

I just received the following response from LSI/3ware support. I am afraid my response to them was not very nice, though I believe it summed up the situation accurately.

LSI/3ware said: We are able to reproduce the issue with Debian unstable kernel 3.9-1-amd64 but engineering does not release software for un-stable or un-released kernels. If possible, please wait until Debian officially releases the kernel. 3dm2 and tw_cli should work with Ubuntu official release 13.04 including updated kernels 3.8.x to 3.10.

My response:

So the end result is:

  • You will not do a fresh install of Debian Testing which will reproduce the problem. I even gave you the link to the "Official" Testing ISO which DOES have the problem.

Instead you first compile a custom kernel which somehow avoids the problem. Then you jump OVER Testing to Unstable to reproduce the problem. Except "engineering does not release software for un-stable or un-released kernels"…so once again you avoid having to take any action.

  • Then you have the nerve to suggest we are not using the Debian official release (we ARE) OR that we can just shut down our services running on all our servers and swap to a new distribution???

The good news for us is we are in the Debian community and will let everyone know how this has been handled by LSI. This is going to send a STRONG signal to the rest of the Linux community about the long term viability of your products.

Thank you

============= my conclusion =============

Yes, we DO use the official Debian Testing release in production and some think that is not wise.

Debating that does not address the problem here though, that eventually the kernel in Testing makes it way into Stable. And the time for any manufacturer to fix their proprietary software that is essential to the use of their product is with the Testing distribution…BEFORE it gets to Stable.

So while we wait for LSI/3ware to decide to load the official Debian Testing and fix their software, we will probably pin our kernel to 3.2. We may also find the time to compile a 3.10 kernel that does not output a date with uname -r to see if that is indeed the cause. If it is we may be able to get that changed in the uname call for the kernel.

Best Answer

I had the same problem here on Debian Testing with Kernel 3.12.XXX. For me the command setarch (or linux64) worked:

web3:~# setarch x86_64 --uname-2.6 tw_cli /c0/u0 show all

or

web3:~# linux64 --uname-2.6 tw_cli /c0/u0 show all