Replica Set Internals Part V: Initial Sync

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The Initial Sync Processes

When you add a brand new member to the set, it needs to copy over all of the data before it’s ready to be a “real” member. This process is called initial syncing and there are, essentially, 7 steps:

  1. Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:
  2. Get the latest oplog time from the source member: call this time start.
  3. Clone all of the data from the source member to the destination member.
  4. Build indexes on destination.
  5. Get the latest oplog time from the sync target, which is called minValid.
  6. Apply the sync target’s oplog from start to minValid.
  7. Become a “normal” member (transition into secondary state).

Note that this process only checks the oplog. You could have a petabyte of data on the server, but if there’s no oplog, the new member will delete all of it and initial sync. Members depend on the oplog to know where to sync from.

So, suppose we’re starting up a new member with no data on it. MongoDB checks the oplog, sees that it doesn’t even exist, and begins the initial sync process.

Copying Data

The code for this couldn’t be much simpler, in pseudo code it is basically:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():
             destinationServer.getDB(db).getCollection(collection).insert(doc)

One of the issues with syncing is that it has to touch all of the source’s data, so if you’ve been carefully cultivating a working set on sourceServer, it’ll pretty much be destroyed.

There are benefits to initial syncing, though: it effectively compacts your data on the new secondary. As it’s doing are inserts, it’ll use pretty much the minimum amount of space. Some users actually use rotating resyncs to keep their data compact.

On the downside, initial sync doesn’t consider padding factor, so if that’s important to your application, the new server will have to build up the right padding factor over time.

Syncing on a Live System

The tricky part of initial syncing is that we’re trying to copy an (often) massive amount of data off of a live system. New data will be written while this copy is taking place, so it’s a bit like trying copy a tree over six months.

By the time you’re done, the data you copied first might have changed significantly on the source. The copy on the destination might not be… exactly what you’d expect.

That is what the oplog replay step is for: to get your data to a consistent state. Oplog ops are idempotent, they can be applied multiple times and yield the same answer. Thus, so long as we apply all of the writes at least once (remember, they may or may not have been applied on the source before the copy), we’ll end up with a consistent picture when we’re done.

Like so.

minValid (as mentioned in the list above) is the first timestamp where our new DB is in a consistent state: it may be behind the other members, but its data matches exactly how the other servers looked at some point in time.

Some examples of idempotency, as most people haven’t seen it since college:

// three idempotent functions:
function idemp1(doc) {
   doc.x = doc.x + 0;
}
 
function idemp2(doc) {
   doc.x = doc.x * 1;
}
 
function idemp3(doc) {
   // this is what replication does: it turns stuff like "$inc 4 by 1" into "$set to 5"
   doc.x = 5;
}
 
// two non-idempotent functions
function nonIdemp1(doc) {
   doc.x = doc.x + 1;
}
 
function nonIdemp2(doc) {
   doc.x = Math.random();
}

No matter how many times you call the idempotent functions the value of doc.x will be the same (as long as you call them at least once) .

Building Indexes

In 2.0, indexes were created on the secondary as part of the cloning step, but in 2.2, we moved index creation to after the oplog application. This is because of an interesting edge case. Let’s say we have a collection representing the tree above and we have a unique index on leaf height: no two leaves are at exactly the same height. So, pretend we have a document that looks like this:

{
    "_id" : 123,
    ...,
    "height" : 76.3
}

The cloner copies this doc from the source server to the destination server and moves on. On the source, we remove this leaf from the tree because of, I dunno, high winds.

> db.tree.remove({_id:123})

However, the cloner has already copied the leaf, so it doesn’t notice this change. Now another leaf might grow at this height. Let’s say leaf #10,012 grow to this height on the source.

> db.tree.update({_id:10012}, {$set : {height : 76.3}})

Now, when the cloner gets to document #10012, it’ll copy it to the destination server. Now there are two documents with the same height field in the destination collection, so when it tries to create a unique index on “height”, the index creation will fail!

So, we moved the index creation to after the oplog application. That way, we’re always building the index on a consistent data set, so it should always succeed.

There are a couple of other edge cases like that which have been super-fun to track down, which you can look up in Jira if you’re interested.

Restoring from Backup

Often, initial sync is too slow for people. If you want to get a secondary up and running as fast as possible, the best way to do so is to skip initial sync altogether and restore from backup. To restore from backup:

  1. Find a secondary you like the looks of.
  2. Either shut it down or fsync+lock it. Basically, get its data files into a clean state, so nothing is writing to them while you’re copying them.
  3. Copy the data files to your destination server. Make sure you get all of your data if you’re using any symlinks or anything.
  4. Start back up or unlock the source server.
  5. Point the destination server at the data you copied and start it up.

As there is already an oplog, it will not need to initial sync. It will begin syncing from another member’s oplog immediately when it starts up and usually catch up quite quickly.

Note that mongodump/mongorestore actually does not work very well as a “restoring from backup” strategy because it doesn’t give you an oplog. You can create one on your own and prime it, but it’s more work and more fiddly than just copying files. There is a feature request for mongorestore to be able to prime the oplog automatically, but it won’t be in 2.2.

P.S. Trees were done with an awesome program I recently discovered called ArtRage, which I highly recommend to anyone who likes painting/drawing. It “feels” like real paint.

  • Aleksey Sivokon

    Thanks, very helpful! One question, though. Say, we have a huge database, so it would take 8 hours just to transfer data from one node to another. Next, due to heavy writes, oplog length is only 2 hours. What’s happening when I add clean secondary? By the time it finishes cloning data, oplog would be already outdated. Or, does it apply oplog as it goes with cloning?

  • kristina1

    That means you need a bigger oplog.  2 hours is way too short, many people have a week or so, or at least a couple days.  I’d say 3*(initial sync time) is the bare minimum.

  • Dharshan

    When I copy over the database files from a backup do I need to delete the local.* files? Doesn’t the local db store the replica set configuration – does this mess up the server (especially if the replica config has changed from the time the backup was taken)?

    If we use the backup to seed the sever index rebuilding should be much faster right?

  • kristina1

    Don’t delete the local files. MongoDB will automatically update the config if it’s changed (so long as at least one member of the old config is still in the set).

    If you delete the local.* files, MongoDB will have to resync the member (defeating the point of taking a backup in the first place).

    The indexes will be there already. They will continue to be written as new writes are replicated.

  • Dharshan

    Thanks for the clarification. How would I do this if none of the old members are around anymore? I am restoring from the backup of a old replica set which doesn’t exist anymore. if I delete the local.* on the new primary and do a rs.initiate with the new members will that work ?

  • kristina1

    Here’s the process:

    * Copy the data files onto one of your new servers.
    * Start mongod without the –replSet option.
    * Modify the config manually (it’s a document in local.system.replset). Do a findOne, update it to contain the new members, and save it back to the local.system.replset collection (make sure the collection has one doc, your desired config, when you’re done).
    * Shut down the mongod
    * Copy this new set of database files to the other members of the set.
    * Start all of the members up normally (with –replSet).

  • Dharshan

    This is great Kristina! Thanks for the detailed steps. Here is another option I was considering
    1. Create the new replica set and let the initial sync start. This will update the local db on the secondaries with the correct data
    2. Stop secondaries and overwrite the data files on the secondaries with the seed data and restart them.

    Will this work? This way I don’t have to manually update the configuration on any of the secondaries.

  • kristina1

    You’re welcome! Your plan will work so long as there are no writes coming into the system. If you go with this plan, don’t copy the local.* files in step 2 (or it will overwrite the new config).

kristina chodorow's blog