Server freezes without kernel panic

hardwarekernelkvm-virtualizationserver-crashessupermicro

We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).

When a node crashes, we sometimes see these strange things on the Supermicro IPMI:

enter image description here
enter image description here

We also saw:

  • "No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
  • The normal login screen or other normal output from the server, but freezed

What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.

As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:

  • A specific VM is causing the issue
  • Kernel bug
  • Hardware issue regarding our setup

More information about the machines:

  • CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
  • Supermicro Case with redundant power supplies
  • Supermicro X10DRi / X10DRWi with latest BIOS version
  • Intel Xeon E5-2630 v3 / v4
  • 512 GB DDR4 ECC RAM (Samsung Server RAM)
  • 145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
  • Software RAID-10 with 8 / 16 SSDs

Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.

Thanks in advance!

Best Answer

This might be a CPU bug. Intel published an errata about this problem and they also provide a microcode update for the E5 v3/v4 CPUs (datecode 20170707). CentOS 7.4 already has a newer microcode version 0xb000021 (in CentOS 7.3 it was 0xb00001e). It may help to exchange the microcode or upgrade to 7.4. I also had a lot of trouble with this system freezes. I exchanged the mainboard (X10DRi), RAM, CPU and powersupply without success. I can't say for sure if this is the solution, because I do not have enough uptime since I updated the microcode. Supermicro still does not provide an updated BIOS with the current Intel microcode. You may get an unofficial prerelease from your distributor for the X10DRI.

Related Topic