Recovering a crashed EC2 instance that is EBS-backed

amazon ec2amazon-ebs

I had an EC2 instance which was EBS backed (i.e. boots off an EBS volume). The hardware appears to have crashed. I'm having trouble getting it back, which is frustrating as the whole point of having an EBS-backed volume is that the disk image should be robust against CPU crashes.

First I tried to make a new AMI based on that machine, but the new AMI was stuck in pending. Diving in with the command-line tools I saw that the machine couldn't stop properly. So I did

ec2-stop-instances --force

and then

ec2-detach-volume --force

But then I couldn't make an AMI from a detached volume. I tried making a new instance and attaching the EBS volume to it (after detaching the one it came with), and booting it, but that one failed to boot saying

"State Transition Reason: Server.InternalError: Internal error on launch"

I'm assuming there has to be a way to get the drive back and running again — that's the point of EBS, right? But how?

Best Answer

I have had an instance crash a few times on me, most notably when AWS had their 'little' EBS failure. Like you, I was unable to terminate the instances or detach the volume. I ended up creating a snapshot of the EBS volume (yes, it let me create a snapshot without detaching), creating a volume from that snapshot and attaching it as the root device on an instance.

Keep in mind that while the physical drive may not have been damaged, a crash can still damage the file system or the data.

I have also had success attaching the volume as an ordinary non-boot volume, running a file system check (e.g. e2fsck), and using rsync, in a procedure akin to what you would use to migrate from ephemeral/instance-store to EBS:

  1. Copy the root (/) directory to the EBS device (rsync -aXHv)
  2. (optionally, rsync the devices as well (/dev), although I don't think it is needed)
  3. flush writes and unmount

The message I ended up 'taking home' was to have current backups even of EBS drives - so I now run ec2-consistent-snapshot frequently on data volumes and (less frequently) on my root volume, and rotate with ec2-prune-snapshots.

Hopefully some combination of the above (snapshot, check disk, rsync) can help you out.

(As an aside, the few other times I have seen this happen, I had some process running that consumed all the memory - and the particular AMI I was using didn't have any swap space setup - the console log (from the AWS console) is good for identifying that kind of problem)