Getting to Know Your Oplog

Keeping with the theme: a blink dog.

This is the second in a series of three posts on replication internals. We’ve already covered what’s stored in the oplog, today we’ll take a closer look at what the oplog is and how that affects your application.

Our application could do billions of writes and the oplog has to record them all, but we don’t want our entire disk consumed by the oplog. To prevent this, MongoDB makes the oplog a fixed-size, or capped, collection (the oplog is actually the reason capped collections were invented).

When you start up the database for the first time, you’ll see a line that looks like:

Mon Oct 11 14:25:21 [initandlisten] creating replication oplog of size: 47MB... (use --oplogSize to change)

Your oplog is automatically allocated to be a fraction of your disk space. As the message suggests, you may want to customize it as you get to know your application.

Protip: you should make sure you start up arbiter processes with --oplogSize 1, so that the arbiter doesn’t preallocate an oplog. There’s no harm in it doing so, but it’s a waste of space as the arbiter will never use it.

Implications of using a capped collection

The oplog is a fixed size so it will eventually fill up. At this point, it’ll start overwriting the oldest entries, like a circular queue.

It’s usually fine to overwrite the oldest operations because the slaves have already copied and applied them. Once everyone has an operation there’s no need to keep it around. However, sometimes a slave will fall very far behind and “fall off” the end of the oplog: the latest operation it knows about is before the earliest operation in the master’s oplog.

oplog time ->
<..............................>
   ^         ^    ^        ^
   |         |    |        |
      slave         master

If this occurs, the slave will start giving error messages about needing to be resynced. It can’t catch up to the master from the oplog anymore: it might miss operations between the last oplog entry it has and the master’s oldest oplog entry. It needs a full resync at this point.

Resyncing

On a resync or an initial sync, the slave will make a note of the master’s current oplog time and call the copyDatabase command on all of the master’s databases. Once all of the master’s databases have been copied over, the slave makes a note of the time. Then it applies all of the oplog operations from the time the copy started up until the end of the copy.

Once it has completed the copy and run through the operations that happened during the copy, it is considered resynced. It can now begin replicating normally again. If so many writes occur during the resync that the slave’s oplog cannot hold them all, you’ll end up in the “need to resync” state again. If this occurs, you need to allocate a larger oplog and try again (or try it at a time when the system is has less traffic).

Next up: using the oplog in your application.

  • Good Stuff. Thanx

  • You say that we shouldn’t let the arbiter preallocate the oplog, but why wouldn’t we want to do that? Isn’t growing a file nearly as expensive as writing to the file? I would assume that the oplog has to 0 out the file as file growth occurs, otherwise random disk garbage might look like valid oplog. By not preallocating oplog space, aren’t you moving the startup preallocation hit to oplog write time while the app is running in production?

  • Anonymous

    The arbiter never uses the file, it doesn’t keep a copy of the oplog. The file is allocated and then just sits around doing nothing and taking up space. In 1.7.2, arbiters won’t even create an oplog, making the protip unnecessary.

  • Cunning and sensible. Thanks for the explanation, especially since I missed the last phrase of your protip.

  • Pingback: Comparing MongoDB and SQL Server Replication | Jeremiah Peschka()

  • Pingback: ehcache.net()

kristina chodorow's blog