Linux – How to troubleshoot hanging linux server

diagnosticlinuxtroubleshootingUbuntu

I have several Ubuntu Server 8.04 machines at a remote location. Every couple of months or so, one of them would stop responding and need to be power cycled. From looking at my log files it seems that all my processes are running fine until at some point everything just stops.

I suspect it's a hardware problem, but I don't even know how to begin pinpointing the issue. Are there any diagnostics tools or techniques designed to track down these sort of problems?

I know this is a fairly general question, but I'm hoping for a general answer.

Best Answer

Hook up another machine and configure a serial console to get all of the kernel messages and such that come up. If it's a kernel panic or some other catastrophic problem, you'll see it there. Monitoring temperature and running a memtest are also recommended, especially if the console shows nothing abnormal before the wheels fall off.