Here is the deal,
come to work only to find out one server isn't responding at all, the machine is turned on, but the screen doesn't show anything att all, doesn't respond to keyboard inputs (I don't have sys rq keys enabled).
The server needs to be up and running as fast as possiblo, so I dod a hard reset of the server and it's all working fine now.
Now my boss want's to know what happened and why.
So how do I start debugging what went wrong before the reboot? Which logs should I pay special attention to, and are there any neat tricks that you might now on how to debug a random server freeze (it doesn't happen often – this is the first time that I've seen it)
Thanks for any usefull guidelines and suggestions.
Ps: I'm running ubuntu server 12.04.
Best Answer
Since it's probably a hardware fault, I'd look at some hardware diagnostics.
If you have a hardware RAID controller, I'd find out if you can read its log (if 3Ware, use tw_cli). And, whether you have hardware or software RAID, you can look at the SMART parameters of the disks (if the disks are connected to a RAID controller, you may need special commands to access them. See the
smartctl
manpage).If you do:
I always primarily look at:
Also, keep an eye on dmesg and syslog to see if you have get errors over time. For example, disk errors often show up long before it's a fatal problem as ata exceptions. We have a central logging server (using rsyslog) that notifies me about ata exceptions. A quick example on how to set that up:
/etc/rsyslog.d/60-smtp.conf:
/etc/rsyslog.d/70-mail-ata-errors:
See here for the ata-to-devicenames script.
Another thing you can do is a memtest. Ubuntu installation DVDs/CDs have those in the boot menu, and I believe any Ubuntu server has one in its regular boot menu as well. Let is make one pass at least, more if possible.
Do you have ECC RAM BTW? ECC RAM is important for long term stability and data integrity.