Linux – What steps should I take to determine the root cause of linux server failure

linuxtroubleshooting

I am sorry if this question has been addressed before, I am assuming it has, but after a half hour of searching I couldn't find anything.

Anyway, to the question:

I am a windows guy and a self-taught programmer so I am very new to linux but am liking it more than Windows. We have a small WordPress installation that fails seemingly at random. When I does I cannot SSH in and my only real option is to do a hard reboot from the Rackspace Cloud admin. It has always fixed the problem.

I want to know what I should be doing to determine what actually caused the problem though. This is a trivial example but we are planning on putting more applications on linux in the next year or so and I want to get to the point that I am comfortable dealing with problems in a more scientific way than "unplug it and plug it back in."

Where should I get started? I am open to books, blog posts, server fault questions, videos, seminars, college classes, anything.

Thanks!

Best Answer

This is a general recipe, it works not only on linux:

Identifing problems, in order:

  1. remote login problems:
    1. network problems
    2. remote login daemon problems (sometime it can take minutes to login with ssh)
  2. load problems (uptime;df -h;free -m)
  3. read the logs (they are in /var/log/. System wide logs are /var/log/messages, /var/log/syslog. In your case, you could be interested in /var/log/apache)

If you hard rebooted your server, be careful to write down the time you did it. So you could check the logs just before that time.

Related Topic