Replication Internals

Displacer beast... seemed related (it's sort of in two places at the same time).

This is the first in a three-part series on how replication works.

Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?

MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).

The oplog

Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:

$ ./mongod --replSet funWithOplogs

Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:

> rs.initiate()

Now if we query the oplog you’ll see this operation:

> use local
switched to db local
> db.oplog.rs.find()
{ 
    "ts" : { "t" : 1286821527000, "i" : 1 }, 
    "h" : NumberLong(0), 
    "op" : "n", 
    "ns" : "", 
    "o" : { "msg" : "initiating set" } 
}

This is just an informational message for the slave, it isn’t a “real” operation. Breaking this down, it contains the following fields:

  • ts: the time this operation occurred.
  • h: a unique ID for this operation. Each operation will have a different value in this field.
  • op: the write operation that should be applied to the slave. n indicates a no-op, this is just an informational message.
  • ns: the database and collection affected by this operation. Since this is a no-op, this field is left blank.
  • o: the actual document representing the op. Since this is a no-op, this field is pretty useless.

To see some real oplog messages, we’ll need to do some writes. Let’s do a few simple ones in the shell:

> use test
switched to db test
> db.foo.insert({x:1})
> db.foo.update({x:1}, {$set : {y:1}})
> db.foo.update({x:2}, {$set : {y:1}}, true)
> db.foo.remove({x:1})

Now look at the oplog:

> use local
switched to db local
> db.oplog.rs.find()
{ "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } }
{ "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } }
{ "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } }
{ "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }

You can see that each operation has a ns now: “test.foo”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.

The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).

Using this information

MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.

Next up: what the oplog is and how syncing works.

  • Pingback: Sysadmin Sunday #1 « Boxed Ice Blog

  • Pingback: Monitoring Mongodb collection updates using PyMongo – Introducing Loki « Wicked Wukong's Pensieve

  • Pingback: Comparing MongoDB and SQL Server Replication | Jeremiah Peschka

  • Joris Röling

    In the description above on o and o2 it seems that o is giving the update modifications, and o2 looks like de first argument to update().

    Or am I crazy?

  • Anonymous

    You are correct! Thanks, I’ve fixed the description.

  • Pingback: links for 2011-02-21 « Bloggitation

  • Pingback: ehcache.net

  • roborative

    Just curious if no-ops occur when an update is requested to set a value to the same value as it currently has.

  • kristina1

    No, at the moment it’ll actually log an update op.  This is sort of just an implementation detail, though, not by-design.  It actually may change to no being logged for 2.4, which is having a lot of the update code re-written.

  • roborative

    Thanks for the quick reply, Kristina! I was hoping to use this data for tracking changes to documents over a period and ignoring updates that were effectively no-ops (nothing modified). Perhaps I can work out a way to determine this using the current output.

  • kristina1

    No problem! I think you’d have to compare the op to the underlying document, at the moment.

  • roborative

    OK. I’m concerned that there’s some chance that the document may have changed in the time since the log entry was created and the process that is recording the event reads it. If so then the underlying document may no longer be current.

  • Talha khan

    I am using mongolab for mongodb hosting , so I can access to op-log when mongodb is in localhost but my db is hosted on web.

  • kristina1

    You cannot access the oplog on a hosted solution, as all the ops from everyone else’s DBs would be visible, too.

  • Brandon Mason

    I set up a test to listen to the oplog for changes to a certain collection.  It’s working fine, but I notice there is a delay of about 1 second maybe… 80% of the time.  The other times it executes quickly, completing in under 100ms.  Is there some delay on the oplog generation?  Is it polling to generate the entries?  Is there some configuration I could use to decrease the lag, and would that have other penalties such as over taxing the CPU?

    I am using the node.js driver.  This is my implementation, though I don’t imagine it will work outside of my application environment:

    https://gist.github.com/bitmage/5077908

    I’ll release this as a simple watcher library once I get it cleaned up.

  • Brandon Mason

    I can see myself getting into this same situation.  Is there a work around or alternate method for creating triggers if you do not have access to the oplog?

  • kristina1

    First, it might be the rate you’re writing to the oplog: oplog queries will hang around for a while waiting for results.  Second, you might want to use the “oplog replay” option (no idea what it’s called in Coffeescript) which makes querying the oplog more efficient.

  • Brandon Mason

    Kristina,

    Thanks for your reply!

    I changed the tailableRetryInterval in my code to 100ms, and that reduced the delay.  So it looks like the tailable cursor is actually polling for results.  Hmm… not ideal.  I’ll take this up on the node-mongodb-native google group though, as it seems it’s an implementation detail of the driver.

  • Brandon Mason

    Ok, so I actually found an answer.  If you initialize the cursor with ‘awaitdata: true’ then it will rely on Mongo server functionality to push out the new data.  Here’s an example from the tests in node-mongodb-native:

    https://github.com/mongodb/node-mongodb-native/blob/master/test/cursor_test.js#L2038

    Using this my test runs in ~100ms (20ms lag over just waiting for the update callback), or if the server’s been idle for a little while it’s more like 500ms.  That should be acceptable.

  • kristina1

    Cool, glad it worked out!  I do recommend still looking into the oplog replay flag.

  • Brandon Mason

    You’re referring to the OplogReplay bit mentioned here, I assume?

    http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/

    I think I’ll have to ask Christian more specifically if his driver handles that… I grepped the code and cannot find it, but maybe he is passing all the options through directly.  What specifically does this flag do?  I can’t seem to find much documentation on it, though I see it mentioned by Scott here:

    https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/b1qiuAIG75A

    Thanks again for your help.  This is an adventure into the lair of the beast, to be certain.  :-)

  • kristina1

    Yes, that’s the flag I’m talking about.  The oplog doesn’t have any indexes, so when you query mongod has to scan every document.  The oplog replay flag makes the query start at the latest document and jump back by 200MB at a time to try to find a timestamp earlier than the one you’re querying for.  Then, once it’s found the right oplog segment to search, it’ll move forward one document at a time.  It makes querying the oplog for a particular timestamp much faster.

  • Brandon Mason

    Alrighty, library is published.  Here it is for any interested:

    https://github.com/TorchlightSoftware/mongo-watch

  • kristina1

    Cool, thanks for sharing!

kristina chodorow's blog