Is it safe to snapshot an mdadm RAID with only xfs_freeze

amazon ec2amazon-ebsmdadmmongodbraid

Is mdadm guaranteed (and trusted via experience) to be safe for taking snapshots with only an xfs_freeze? I have encountered vague warnings about mdadm still working in the background, thus making snapshots unsafe without disassembling the RAID first, but I'd rather avoid having to go through the disassembly/reassembly if possible.

The snapshots are block-level via EBS, but I am less concerned about the snapshotting process than I am about the correct data being consistent when the snapshot is taken.

Options besides mdadm are welcome if they make the process more reliable — we used to use LVM striping, but switched to mdadm specifically because of reports of cross-disk snapshots not being reliable. We are also looking into some combination, such as using mdadm for striping and an LVM layer for snapshotting.

The ideal solution would be one that avoids having to stop the services running off of the RAID (in this case, Mongo) and would be in the original data format so that a new server could attach an array of the restored snapshots and not require additional steps to massage the data into place. (We already have code that can reassemble snapshots into a new server — we just need to reliably create those snapshots.)

Best Answer

even this question is rather old I want to give a short answer to the question if snapshotting an EBS-RAID is safe. We're working with PIOPS EBS-RAID0 and we do our regular backups of this RAID with the following procedure:

  1. Stopping the service (DB in our case)
  2. fsfreeze the mountpoint of the RAID (we're using ext4 but this should work with all fs which are able to freeze)
  3. Call the EC2-API to snapshot the devices which are part of the RAID
  4. Wait for the callback of the snapshot (you don't need to wait until the snapshot has finalized - the callback from the API is enough)
  5. unfreeze the RAID
  6. Start the service

The whole procedure takes around 1-2 minutes in our case.

We changed our instances and systems very often in the near past and we always used these snapshots to copy the data to our new instances (and rebuild the RAID there) to reduce the sync time between the replicas. We never had any issues with corrupt data - the snapshoting just works fine!

Hope this helps someone who is searching for an answer.