Linux – How to diagnose a linux server freeze without any kernel panics or logs

linuxUbuntu

The past few months a small cluster of Ubuntu servers has been crashing for me. Most of the time when a server crashes I can find some evidence of the reason in either dmesg, syslog, or one of the various other log systems. However for this particular freeze nothing is happening at all in any of the logs. The system just literally freezes, no keyboard input is working, no ping coming from the system at all. It's offline but still drawing power.

If it was just one server I would blame the hardware but we are talking about multiple servers, with multiple generations of CPUs, RAM and motherboards all having the issues.

I've tried upgrading to various kernels (currently on 4.15) but that also did not cause the issue to stop.

What I'm mostly looking for is some way to increase the kernel logging or some other way to get some kind of information from the frozen server as to what it was doing before it froze.

Best Answer

Hard lockups are often due to power management problems. Try to:

  1. disable, in the BIOS, any power management setting (eg: P/C states) or, alternatively, use the "maximum performance" profile;

  2. inside the OS, use the "performance" power governor (ie: cpupower frequency-set -g performance);

  3. use a polling idle kernel setting using the idle=poll boot parameter;

  4. disable any C states using intel_idle.max_cstate=0 processor.max_cstate=0 kernel boot parameters.

In order to discover the cause of your lockup, you should apply one change at time. Moreover, please note that step #3 and #4 will have a significant impact of power consumption/efficiency, so you should use the suggested kernel cmdline for test/diagnose only.