Replica Set Internals Part V: Initial Sync

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The Initial Sync Processes

When you add a brand new member to the set, it needs to copy over all of the data before it’s ready to be a “real” member. This process is called initial syncing and there are, essentially, 7 steps:

  1. Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:
  2. Get the latest oplog time from the source member: call this time start.
  3. Clone all of the data from the source member to the destination member.
  4. Build indexes on destination.
  5. Get the latest oplog time from the sync target, which is called minValid.
  6. Apply the sync target’s oplog from start to minValid.
  7. Become a “normal” member (transition into secondary state).

Note that this process only checks the oplog. You could have a petabyte of data on the server, but if there’s no oplog, the new member will delete all of it and initial sync. Members depend on the oplog to know where to sync from.

So, suppose we’re starting up a new member with no data on it. MongoDB checks the oplog, sees that it doesn’t even exist, and begins the initial sync process.

Copying Data

The code for this couldn’t be much simpler, in pseudo code it is basically:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():

One of the issues with syncing is that it has to touch all of the source’s data, so if you’ve been carefully cultivating a working set on sourceServer, it’ll pretty much be destroyed.

There are benefits to initial syncing, though: it effectively compacts your data on the new secondary. As it’s doing are inserts, it’ll use pretty much the minimum amount of space. Some users actually use rotating resyncs to keep their data compact.

On the downside, initial sync doesn’t consider padding factor, so if that’s important to your application, the new server will have to build up the right padding factor over time.

Syncing on a Live System

The tricky part of initial syncing is that we’re trying to copy an (often) massive amount of data off of a live system. New data will be written while this copy is taking place, so it’s a bit like trying copy a tree over six months.

By the time you’re done, the data you copied first might have changed significantly on the source. The copy on the destination might not be… exactly what you’d expect.

That is what the oplog replay step is for: to get your data to a consistent state. Oplog ops are idempotent, they can be applied multiple times and yield the same answer. Thus, so long as we apply all of the writes at least once (remember, they may or may not have been applied on the source before the copy), we’ll end up with a consistent picture when we’re done.

Like so.

minValid (as mentioned in the list above) is the first timestamp where our new DB is in a consistent state: it may be behind the other members, but its data matches exactly how the other servers looked at some point in time.

Some examples of idempotency, as most people haven’t seen it since college:

// three idempotent functions:
function idemp1(doc) {
   doc.x = doc.x + 0;
function idemp2(doc) {
   doc.x = doc.x * 1;
function idemp3(doc) {
   // this is what replication does: it turns stuff like "$inc 4 by 1" into "$set to 5"
   doc.x = 5;
// two non-idempotent functions
function nonIdemp1(doc) {
   doc.x = doc.x + 1;
function nonIdemp2(doc) {
   doc.x = Math.random();

No matter how many times you call the idempotent functions the value of doc.x will be the same (as long as you call them at least once) .

Building Indexes

In 2.0, indexes were created on the secondary as part of the cloning step, but in 2.2, we moved index creation to after the oplog application. This is because of an interesting edge case. Let’s say we have a collection representing the tree above and we have a unique index on leaf height: no two leaves are at exactly the same height. So, pretend we have a document that looks like this:

    "_id" : 123,
    "height" : 76.3

The cloner copies this doc from the source server to the destination server and moves on. On the source, we remove this leaf from the tree because of, I dunno, high winds.

> db.tree.remove({_id:123})

However, the cloner has already copied the leaf, so it doesn’t notice this change. Now another leaf might grow at this height. Let’s say leaf #10,012 grow to this height on the source.

> db.tree.update({_id:10012}, {$set : {height : 76.3}})

Now, when the cloner gets to document #10012, it’ll copy it to the destination server. Now there are two documents with the same height field in the destination collection, so when it tries to create a unique index on “height”, the index creation will fail!

So, we moved the index creation to after the oplog application. That way, we’re always building the index on a consistent data set, so it should always succeed.

There are a couple of other edge cases like that which have been super-fun to track down, which you can look up in Jira if you’re interested.

Restoring from Backup

Often, initial sync is too slow for people. If you want to get a secondary up and running as fast as possible, the best way to do so is to skip initial sync altogether and restore from backup. To restore from backup:

  1. Find a secondary you like the looks of.
  2. Either shut it down or fsync+lock it. Basically, get its data files into a clean state, so nothing is writing to them while you’re copying them.
  3. Copy the data files to your destination server. Make sure you get all of your data if you’re using any symlinks or anything.
  4. Start back up or unlock the source server.
  5. Point the destination server at the data you copied and start it up.

As there is already an oplog, it will not need to initial sync. It will begin syncing from another member’s oplog immediately when it starts up and usually catch up quite quickly.

Note that mongodump/mongorestore actually does not work very well as a “restoring from backup” strategy because it doesn’t give you an oplog. You can create one on your own and prime it, but it’s more work and more fiddly than just copying files. There is a feature request for mongorestore to be able to prime the oplog automatically, but it won’t be in 2.2.

P.S. Trees were done with an awesome program I recently discovered called ArtRage, which I highly recommend to anyone who likes painting/drawing. It “feels” like real paint.

kristina chodorow's blog