I have had an instance crash a few times on me, most notably when AWS had their 'little' EBS failure. Like you, I was unable to terminate the instances or detach the volume. I ended up creating a snapshot of the EBS volume (yes, it let me create a snapshot without detaching), creating a volume from that snapshot and attaching it as the root device on an instance.
Keep in mind that while the physical drive may not have been damaged, a crash can still damage the file system or the data.
I have also had success attaching the volume as an ordinary non-boot volume, running a file system check (e.g. e2fsck), and using rsync, in a procedure akin to what you would use to migrate from ephemeral/instance-store to EBS:
- Copy the root (/) directory to the EBS device (
rsync -aXHv
)
- (optionally, rsync the devices as well (/dev), although I don't think it is needed)
- flush writes and unmount
The message I ended up 'taking home' was to have current backups even of EBS drives - so I now run ec2-consistent-snapshot frequently on data volumes and (less frequently) on my root volume, and rotate with ec2-prune-snapshots.
Hopefully some combination of the above (snapshot, check disk, rsync) can help you out.
(As an aside, the few other times I have seen this happen, I had some process running that consumed all the memory - and the particular AMI I was using didn't have any swap space setup - the console log (from the AWS console) is good for identifying that kind of problem)
Ephemeral drives should be considered extremely volatile and absolutely nothing that you need to retain should be kept on them. It's a good place for unimportant logs and things like tempdb. If one of the ephemeral volumes fails, however, there will be no way to restore it.
Ephemeral volumes are hardware volumes on the instance's physical host. If the drive dies your best hope is that someone at Amazon is able to replace the physical disk without the host going down.
With regards to the EBS volumes; they can be added to the instance easily but you would need to do this yourself or automate the process but there is nothing in the system already to handle it automatically.
You could set up a cron job on the instance that checks the RAID health and in the event of a failure you could have it talk to the EC2 API to add another volume.
Best Answer
Unsatisfied that my question has been understood properly, I've run the experiment for myself. The outcome is that...
Yes, on stop/start everything under /mnt is lost and you can't mount the drive without recreating the mount point. As I expected, but...
If you add an entry to /etc/fstab it doen't matter that the mount point doesn't exist, it will be created and the drive mounted.
sudo mount /dev/xvdf /mnt/test
- Fine.sudo mount /dev/xvdf /mnt/test
- Fine.sudo mount /dev/xvdf /mnt/test
- Error: mount point /mnt/test does not existI haven't tested how deep this autocreation goes. If I mount at /mnt/a/b/c would it still work?