Linux Troubleshooting – How to Determine if Linux Freezes are Caused by Hardware or Software

hardwarelinux

a couple of weeks ago, my linux server (kubuntu 10.04) started to give me trouble.

it freezes after a certain uptime, seemingly between a couple of minutes and a few hours – GUI is unresponsive, no reaction to mouse or keyboard (not even REISUB), top in an ssh session stops updating and the session itself is aborted after a timeout:

Read from remote host 10.1.1.9: Operation timed out
Connection to 10.1.1.9 closed.

back then, I assumed a hardware issue, so i started replacing more and more hardware – graphics card, motherboard, cpu, ram, harddrives, psu. now i've replaced the entire machine and it still freezes.

i've checked /var/log/messages and some other logs – there is no clue in them at all. a hardware issue seems unlikely considering it's all been replaced, but is still possible.

i've stripped the machine down to the bare minimum. i boot a kubuntu live system from a usb stick, mount a couple of harddrives read-only and start diffing folders on them. this seems to produce the freeze somewhat reliably. so far, i haven't gotten beyond a few hours of uptime.

my server is down, this has been going on for weeks now. i am at the end of my wisdom and i am clutching at straws.

how can i reliably determine if this is a hardware or a software issue ?
how would you approach a problem like that ?

Best Answer

Since you have replaced such lot of hardware, I presume you have already made sure your problem isn't about temperature issues.

What if you try out some completely different distro instead of Kubuntu 10.04? Download some other live distribution, for example openSUSE or even some BSD flavour, and see if they reproduce the freeze as well. That way you can be sure this isn't some kind of bug in Kubuntu 10.04.

How much data you have under the directory trees you are diffing? And more importantly, are there only couple of large files or huge number of small files?

When you replaced the hard drives, how did you copy the data from the old drive to another? dd_rescue or some imaging program? Just plain old cp? If you used some kind of imaging program or dd_rescue and the original filesystem somehow contained some strange corruption, perhaps diffing hits the corrupted area and causes a crash? Rare and unlikely, but certainly possible. Just like it's possible that a lightning hits you out there.