Linux – Full data backup to Amazon S3

backupelasticsearchlinuxmongodb

I have an Ubuntu server hosted on Digital Ocean that has outgrown it's existing backup solution.

The relevant parts of the stack I use are Node.js, MongoDB, and Elasticsearch.

So far backups have been done by dumping the entire MongoDB database, saving the ES configuration, and copying all other files (logs, the code itself, etc.) in the application directory. If it's the first of the month all user files are also copied, otherwise only files changed since the first of the month are added. All of these are then zipped up in one file and uploaded to Amazon S3.

The data size has reached the point where this process takes too much disk space and the file can't be uploaded to S3 in one shot.

What is the next level for an application of this size (8 GB of user files, 125,000 users, 3,000 other docs, all searchable in ES)?

I understand opinion-based questions are not OK on Server Fault. I'm not asking for opinions, just what the normal, cost-effective, solution is for an application of this size.

UPDATE: These are the relevant parts of the script and configuration I'm attempting to use Duplicity with. I'm using Node to manage the backup since it fits with my existing logging solution, is already scheduled to align with everything else in a low-activity time, and is portable between OSes.

Node script, logging of course needs improvement:

// Walks a directory recursively and returns a flat list of files
function walkDir() {};

// Node based rm -rf
function rmrf() {};

exec("mongodump --out dump", { cwd: process.cwd() }, function(err, dta) {
    if (err) return log("Error backing up: couldn't dump MongoDB!");

    exec("sudo duply ats backup", function(err) {
        if (err) log("Error running Duplicity");
        else rmrf("dump");

        log("Exiting.");

        process.exit();
    });
});

Duplicity config:

GPG_PW='GPG password'

TARGET='s3://s3-us-east-1.amazonaws.com/bucket'

TARGET_USER='Known working AWS credentials'
TARGET_PASS='AWS secret key'

SOURCE='/var/appdir'

MAX_AGE=6M

DUPL_PARAMS="$DUPL_PARAMS --exclude "/var/appdir/elasticsearch/data/**" "

I've tried --s3-use-new-style, using s3+http://, and setting S3_USE_SIGV4 but have had no luck.

This is the log I'm getting from Duplicity:

Start duply v1.5.10, time is 2015-07-05 09:30:13.
Using profile '/root/.duply/ats'.
Using installed duplicity version 0.6.23, python 2.7.6, gpg 1.4.16 (Home: ~/.gnu                                                                     pg), awk 'GNU Awk 4.0.1', bash '4.3.11(1)-release (x86_64-pc-linux-gnu)'.
Signing disabled. Not GPG_KEY entries in config.
Test - Encryption with passphrase (OK)
Test - Decryption with passphrase (OK)
Test - Compare (OK)
Cleanup - Delete '/tmp/duply.25562.1436103014_*'(OK)

--- Start running command PRE at 09:30:14.155 ---
Skipping n/a script '/root/.duply/ats/pre'.
--- Finished state OK at 09:30:14.183 - Runtime 00:00:00.027 ---

--- Start running command BKP at 09:30:14.208 ---
Reading globbing filelist /root/.duply/ats/exclude
BackendException: No connection to backend
09:31:27.427 Task 'BKP' failed with exit code '23'.
--- Finished state FAILED 'code 23' at 09:31:27.427 - Runtime 00:01:13.218 ---

--- Start running command POST at 09:31:27.465 ---
Skipping n/a script '/root/.duply/ats/post'.
--- Finished state OK at 09:31:27.491 - Runtime 00:00:00.026 ---

Best Answer

I have good experience backing up using duplicity. If you are able to do a snapshot and mount it read only, then it's a very good option to have consistent incremental backup.

Usually the problem is with backing up databases (MongoDB, ElasticSearch, MySQL, you name it) is consistency. The same things apply to backing common files, but with databases, the risks of data corruption are probably the highest.

You have few options (hopefully others will add more)

  1. Dump the database and backup the dump. That is the simplest, safest and most straightforward.

  2. Stop the database (or use other method to make on-disk data consistent) and do the backup. (this way it causes a long downtime, not always possible)

  3. Stop the database (as in #2), do a snapshot (volume or fs, make sure the fs is consistent at that point), start the database, mount the snapshot readonly and back that up. But not all setups are suitable for this.

  4. Stop the database (as in #2), do a snapshot (this time it works only for volumes, make sure the fs is consistent at that point), start the database, back up the snapshot as a block device. This might increase the size of backup and again it might not be possible on all configurations.

  5. Back up the live database files, and hope it will work when you restore. (You're playing with fire here.) If at all possible, stay away from this.

  6. If your technology has a special way of backing up, use that. (Like direct snapshot backup from ELB to S3.)

Any way you choose, please keep in mind you definitely should test that you are able to restore from the backup several times, from several different backups.

#!/bin/bash
BACKUP_BASE="/data/backups/"
DIRNAME="mongo"
BUCKET="mybackups"
ARCHIVE_DIR="/data/backups_duplicity_archives/${DIRNAME}"
VERBOSE="-v 4"
S3_PARAMS="--s3-use-new-style" # --s3-use-multiprocessing" # --s3-use-rrs"
export PASSPHRASE="something"
export AWS_ACCESS_KEY_ID="AN_ID"
export AWS_SECRET_ACCESS_KEY="A_KEY"

cd ${BACKUP_BASE}
rm -rf ${BACKUP_BASE}/${DIRNAME}
/usr/bin/mongodump -h 10.0.0.1 -o ${BACKUP_BASE}/${DIRNAME}/databasename --oplog

/usr/bin/duplicity $S3_PARAMS --asynchronous-upload ${VERBOSE} --archive-dir=${ARCHIVE_DIR} incr --full-if-older-than 14D ${BACKUP_BASE}/${DIRNAME} "s3+http://${BUCKET}/${DIRNAME}"
if [ ! $! ]; then
        /usr/bin/duplicity $S3_PARAMS ${VERBOSE} --archive-dir=${ARCHIVE_DIR} remove-all-but-n-full 12 --force "s3+http://${BUCKET}/${DIRNAME}"
        /usr/bin/duplicity $S3_PARAMS ${VERBOSE} --archive-dir=${ARCHIVE_DIR} remove-all-inc-of-but-n-full 4 --force "s3+http://${BUCKET}/${DIRNAME}"
fi