Quickest/best way to copy a portion of a large mongo database to another server

mongodb

I have a dataset of 100m tweets stored in Mongo, unoptimized and unindexed.

I need to copy all tweets from the last month onto another server, what is the best way to do this?

My idea was to use a Ruby script to extract and copy the relevant tweets to a new database on the server, then run the mongo copyDatabase command to copy it over. Its taking horrendously long though, any other way to do it?

require 'mongo_mapper'
MongoMapper.database = 'twitter'
require './models'
tweets = TwitterTweet.where(:created_at => {"$gt" => 1.month.ago}).all; # about 15 million

MongoMapper.database = 'monthly'
# copy the tweets over to the new db
tweets.each do |tweet|
  tweet.save!
end;

Best Answer

First, you mention that it is unindexed/unoptimized - is there at least an index on created_at? If not, then you are doing a massively inefficient query (a table scan) and that is not going to be terribly fast.

In general, probably the easiest way to do this is to just have the existing server be a primary and then create a secondary (see Replica Set Fundamental Concepts). In a replica set, when you add a new secondary, it syncs from the primary, cloning all existing data and then applying any subsequent changes via the oplog. Once you are happy that you have all the data you need (and assuming you don't want to keep the replica set), just restart the mongods without the --replSet argument (and on a different port is usually a good idea) and you will have a complete copy on the new host.

A more manual approach would be to shut down the current mongod (or fsyncLock() to guarantee no changes) and then copy over the database files manually to a new host - they will be in your dbpath and will look like this:

<databasename>.ns
<databasename>.0
<databasename>.1
<databasename>.2
etc.

They contain all the information another mongod will need, so once you copy them to the new host and you have start the MongoDB instance you should be able to simply use <databasename> and be good to go.

In each case, for any unused/unwanted portions, just drop them once you are up and running on the new host and then run a repair if you want to reclaim space on disk.

Finally, if you really do want to just take a portion of the records, then you could mongodump the relevant collection with a query filter and then use mongorestore to import into the new host. I don't think this will be massively faster than the ruby you propose above though, especially if you are lacking that created_at index.

Related Topic