Troubleshooting thesterious server freezes on Amazon EC2

amazon ec2lamp

I have an Amazon EC2 instance running LAMP on Ubuntu Natty/11.04. On three separate occasions within the last few months, two of which in the last two weeks, the server has just… stopped. It becomes unresponsive and stops responding to connection attempts (SSH or otherwise), but the EC2 control panel still reports it as running. Each time I had to reboot the instance through the console, with ensuing data loss.

So, now I'm trying to diagnose the issue, but I'm coming up blank and I need advice on what else to check for. Syslog contains nothing suspicious — on each occasion, the last thing that happened was munin running its regular five-minute cronjob, although since I don't know exactly when the machine stopped working, I can't say how close the cron log is to the point of freezing. After that, it's as if the machine was simply not running until the point where it was restarted, after which point syslog contains what looks to me like normal dmesg output.

There seems to be no correlation between traffic volume and the time of these freezes. Each occasion has been far removed from peak traffic times.

What else can I look at to attempt to figure out what has been causing these issues? What might the issue be?

ADDENDUM: The server was not under heavy load at any occasion when it went down. CPU and memory use were both well and safely under limits. There was plenty of free disk space (tens of gigabytes). There is nothing strange in Apache or MySQL logs either, they just stop operating at that time. This is a medium/high-CPU instance.

Best Answer

First thing you should do is setup some monitioring to let you know when the server becomes unresponsive. You can do this by using pingdom and/or cloudwatch to check service uptime and system stats like cpu and ram. Both have free plans for small accounts. This will allow you to get an idea when it goes down and should make it easier to hunt the logs as to what was going on at that moment. Usully something like this might be caused by the system not having enough resources, you dont mention what is the size of you instance but something like a micro could be just pegging 100% cpu by a simple cron job and at which point server just locks up.

Aloso check other logs beside syslog, check all app logs to see if any of them are throwing an error before your system goes down.