Linux – Anybody experiencing full system lockups with LUKS

dmcryptlinuxluksserver-crashes

I've recently setup a couple of new servers. This time I'm encrypting most of my partitions using dmcrypt+LUKS. However these new servers crash very often, every few days. Full lockups, kernel does not respond to keyboard, system does not ping. According to Munin graphs and atop records, there has been no increase in resource usage. There are no relevant log records in the local syslog logs. There are no relevant records on our remote log host (which the new servers forward syslog to). There are no relevant netconsole messages (the new servers forward all kernel messages using netconsole to a log host). The kernel didn't even print anything to the TTY. I asked the hosting company to perform a full hardware test, and they found nothing. I'm suspecting LUKS. Does anybody else also experience full lock ups with LUKS? The only reference I could find is http://ubuntuforums.org/showthread.php?t=2125287.

Best Answer

I had similar problems when trying to set up an Arch and Debian system on a dmcrypt+LUKS partition. The issue always surfaced while secure-erasing the LUKS partition using the dd if=/dev/zero of=/dev/mapper/crypt1 command, after around overwriting 6-7GB of data. It turned out to be faulty memory module, one out of 4x4GB.

Point 4.3 on the cryptsetup FAQ page describes how faulty memory can cause drastic corruption while writing to encrypted devices, and related symptoms like freezing and lock-ups, which lead me to suspect a faulty memory.

If I were you I would be suspicious about how that hosting company checked their systems. Tell them to forward you the results of at least one cycle of Memtest86+ and Memtester.

NOTES

Just for reference I am listing some of the posts/discussions describing similar issues I went through while searching for hints and solution:

  • This guy had some CPU lock-ups reported by the watchdog processes. Though it seems his issue is not related to encryption or faulty memory, rather a faulty CPU fan, this was when I started to suspect hardware problems.
  • These guys seem to have similar sympthoms, and the last sentence in the thread mentions "large amount of RAM".
  • This thread (also here) describes a soft lock-up issue with kernel version 2.6.24, a long time ago, for which a patch was submitted back then. The sympthoms seem similar, but the root cause for me was different. This post seem to describe the same issue too.