Linux – What can cause a kernel hang on redhat 4

kernelkernel-paniclinuxredhat

I've to solve a nasty problem on a ten machine "cluster": randomly one of these machine hang during an hard computation, sometime still ping sometime not.

The problem was described me at the phone, I've still no touch/see these machine, so I can't be more precise. It seem there's no (real) keyboard or monitor linked to them, so I haven't nothing about keyboard led or messages on monitor.

Don't worry, what I really need is some suggestion where to search the problem, some suggestions on what can cause a kernel hang on a working machine.

I also see this post, but seem same need on a different situation.

My ideas since now:
– HW problem (ram, cpu, fan etc.)
– bad autofs configuration
– bad nfs(?) configuration
– presence of a trojan/hacker/etc
– /dev/"swap" linked to /dev/zero
– kernel out of memory(??)
– kernel bugged

In other words I try to imagine what kind of envent can occour that can crash the kernel insted of the application that generate the event.

What hang have YOU experienced before? Write it to me!

TIA

Best Answer

First of all, while RHEL 4 is pretty old by itself, it is still maintained and you can try to update with the latest patches (See the Wiki information).

A kernel panic / hang may come from a bunch of reasons. The ones I experienced are mainly due to

  1. Memory problem: install (for instance) an Ubuntu version on a CD, and boot it on it just run memtest86+, it checks actively the memory (may take some time to reveal a problem).

  2. Hardware problem: causing unexpected interruptions that either put the system in a irrecoverable situation, send the kernel execution into "space", break the stack...

  3. Module problem: an inappropriate module (a module which doesn't match exactly the hardware for instance, or a bugged module) has a privileged access and may hang the system. Older kernels are particularly at risk (newer versions better recover having a defective module problem).

Have also seen mysterious (old) system hangs that were due to

  1. The motherboard CMOS battery that was dead (change it, it's cheap).

  2. A bad network cable

Maybe the right time to upgrade to a newer system (nowadays, there is nothing wrong having a server with Ubuntu 10.04.1 LTS for instance).