Linux – How to find out why the EC2 instance was down for a while

I have an instance running at Amazon EC2. I just checked the monitoring, and I saw that the server was down for a while.

TO be more precise, there is completely no line between 15:16:00 and 15:46:00 in the monitoring graphs of the EC2 console. I also have confirmation from Uptimerobot that my servers where down.

Apparently my server was down for exactly 30 minutes. I have gone trough nginx logs, and the system log, but I could not find anything out of the ordinary. Everything works just fine now.

Can I find out what happened somehow, it is really strange.

This is what happened to php-fpm.

[29-Dec-2011 23:27:34] NOTICE: fpm is running, pid 1131
[29-Dec-2011 23:27:34] NOTICE: ready to handle connections
[04-Jan-2012 15:48:07] NOTICE: fpm is running, pid 1169
[04-Jan-2012 15:48:07] NOTICE: ready to handle connections
[04-Jan-2012 15:51:22] NOTICE: fpm is running, pid 1167
[04-Jan-2012 15:51:22] NOTICE: ready to handle connections

Nginx log. There was no real activity during that period. The server is only used for a small website for now.

220.181.108.175 - - [04/Jan/2012:14:30:50 +0000] "GET / HTTP/1.1" 404 22 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
123.125.71.105 - - [04/Jan/2012:14:32:14 +0000] "GET / HTTP/1.1" 404 22 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
74.86.158.106 - - [04/Jan/2012:15:48:41 +0000] "GET / HTTP/1.1" 200 9208 "-" "Mozilla/5.0+(compatible; UptimeRobot/1.0; http://www.uptimerobot.com/)"
124.115.0.157 - - [04/Jan/2012:15:56:15 +0000] "GET / HTTP/1.1" 301 178 "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
74.86.158.106 - - [04/Jan/2012:15:56:43 +0000] "GET / HTTP/1.1" 200 9208 "-" "Mozilla/5.0+(compatible; UptimeRobot/1.0; http://www.uptimerobot.com/)"
74.86.158.106 - - [04/Jan/2012:15:57:50 +0000] "GET / HTTP/1.1" 200 3836 "-" "Mozilla/5.0+(compatible; UptimeRobot/1.0; http://www.uptimerobot.com/)"
77.21.146.23 - - [04/Jan/2012:16:06:52 +0000] "GET /robots.txt HTTP/1.1" 301 178 "-" "findlinks/2.0.2 (+http://wortschatz.uni-leipzig.de/findlinks/)"

Further, the only ports that are open on the system are 80, 443, 12345 (ssh). I do not know where to find the actual ssh log, but I did a logwatch dump, and SSH showed nothing.

These are the monitoring graphs

enter image description here

@James Little

I have checked /var/log/btmp, the file has been last changed 1-1-2012 and is 0 bytes.

ifconfig show me everything 0, I assume no errors and everything is ok. I don't really have the knowledge to work with ifconfig and ethtool as you suggested. I tried some google searches but failed to find some solid methods that would give me some information.

I think I will send an email to Amazon now, maybe they have some answers.

Best Answer

You don't specify if it actually rebooted. In case you did not check - use uptime to see when it last rebooted, or go thorough syslog or dmesg (from your php-fpm log I guess it did reboot). Since it was unavailable for some 30 minutes, it doesn't looks like some planned upgrade (unless they decided to "update" all the datacenter instances at once ;) .

If it was reboot, it's either some failure inside your instance or failure at amazon - again, look at syslog/dmesg.

If it wasn't reboot it could be also some issue that affected just the monitoring.

Amazon have status page of their datacenter issues, with history (somewhere on your EC2 dashboard). For planned reboots, in EC2 you have history too (under EC2 it's just above "instances", if I remember well).

Single instance unavailability is a normal (I did not say common) issue though. It's not feasible to totally prevent it.

Best Answer

Related Solutions

Amazon EC2 – How to Add a Security Group to a Running Instance

Linux – Why does apache log requests to GET http://www.google.com with code 200

Related Topic