How MongoDB’s Journaling Works

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Good idea, Patrick!

So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

The journal appends each change description it gets.

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.

Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

  • woaksie

    How does the oplog fit into all this?

  • http://www.justaprogrammer.net Justin Dearing

    Awesome explanation. Do you know of any articles that compare and contrast mongo’s journaling to a journal used by a file system or a database?  Obviously the basics are the same (write changesets directions sequentially to disk before actually writing to disk for replay in the future). However, it would be nice to see how different datastores with different considerations solved a similar problem.

  • kristina1

    Good question.  The oplog is a normal collection.  It is journaled in the same way that every other collection is journaled.  If mongod is running without journaling and crashes, the oplog may be corrupt like any other collection.

    MongoDB could have been designed to use the journal instead of the oplog for replication.  However, replication was written before journaling was implemented.  This might be an option in the future, but there are some benefits to having a “human-readable” replication log.

    Does that make sense?

  • kristina1

    Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

  • kristina1

    Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

  • kristina1

    Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

  • http://blog.serverdensity.com/ David Mytton

    What characteristics of the journal mean that it is durable, in the sense that it doesn’t get affected by a crash? The append only nature?

  • kristina1

    Once the data is written to the journal, it never changes (so once it’s written it’s safe).  The interesting thing is that the machine could go down in the middle of writing a ledger (entry) to the journal, in which case some of the ledger may be written, some may not be.  Thus, each ledger has a header and footer with a checksum so that, before replaying it, mongod knows that the whole thing was written correctly to disk.  If the checksum doesn’t match the data or the footer is missing (or whatever), the ledger is discarded and that write is lost (and due to the append-only nature of the file, that can only happen to the final ledger). 

  • http://blog.serverdensity.com/ David Mytton

    Looks like my comment above was flagged for review and isn’t showing…

  • http://twitter.com/xXstandstillXx Machika Kara Kuro

    to me the fact that its written for ever does not sit well with me. I feel like if the mongod server restarts and the journal and data files are in sync then the journal should be cleared it seems like wasted disk space to have both versions there, I also think that as the journal was to get longer and mongod does a read to check that the last writes of an unsafe shutdown were written also to the data file that having a 20gb+ file to parse would be a pain, even if they read the file from end to start they have a lot of over head to handle.

  • kristina1

    I’m so confused… I couldn’t even find it in Disqus.  I tried “editing” it and resaving (without changing anything).

  • http://blog.serverdensity.com/ David Mytton

    Thanks, is showing now.

  • kristina1

    The journal files are cleared once they’ve been used.  MongoDB should only ever keep around a couple of journal files at a time (each journal file is 2GB, so you’ll never have a 20GB file).  You’ll generally have one or two “active” journal files and two preallocated journal files.

  • http://twitter.com/comerford Adam Comerford

    Hi Kristina – you mention that the shared view is flushed to disk (in the background) every 60 seconds.  The journal, by default, is flushed to disk every 100ms – is that the append in the diagram when the private view appends to the journal file?

  • kristina1

    Yes, exactly.  The 100ms flush is when it takes all changes to the private view and appends them to the journal.

  • http://blog.poundbang.in Harish Mallipeddi

    When you issue an msync() on the shared view, how do you guarantee the on-disk data file won’t be corrupted? If it crashes in the midst of an msync(), there’s no guarantee as to what order pages got written to disk and if there were partial page writes to disk? It seems to me like the journal is not going to help in these cases. In traditional databases like InnoDB, there’s the double-write buffer to guard against partial page writes.

  • kristina1

    See Eliot’s answer on the MongoDB blog: http://blog.mongodb.org/post/33700094220/how-mongodbs-journaling-works#comment-684898620.  To elaborate a bit, the shared view is only flushing changes that have already been written to the journal.  Therefore, if pages are partially flushed and then the machine crashes, it doesn’t matter: the journal has the full version of those partially flushed changes.  It can just rewrite those pages on start up.

  • http://blog.poundbang.in Harish Mallipeddi

    Thanks for the answer. So the journal does have full versions of pages (not just the diff) – that sounds similar to how postgres does things (there’s a full version of each page in the WAL after each checkpoint).

  • kristina1

    The journal doesn’t have to keep around full pages because it doesn’t really matter if the unchanged parts of the page were half-flushed: they were being re-written to the same values that they were before, so there’s no way to “corrupt” them.

  • kristina1

    One clarification: my coworker Scott mentioned that you might be talking about when the log sequence number is written, which does not get updated until after the shard view sync is complete.

  • http://twitter.com/tobiastrelle Tobias Trelle

    Excellent post, much better explained than in the official documentations.

    But still, I have some questions:

    1) I assume the journal file is not mmap-ped? Is that true?

    2) Why is writing to the journal file is more secure than writing to the data file? Let’s assume the disk is full, so mongod cannot append journal entries any longer. In that situation writing to the data file may still be possible because the corresponding file was already preallocated.

    3) Is it true that journalling is only intended for single node durability? From what I know, in a replica sets the oplog is used to recover out-of-date nodes.

    TIA
    Tobias

  • kristina1

    1) Correct.

    2) As you’ll see if you try to write _a lot_ of data, writes requests will block if the journal is unable to flush.  So it should just block all writes if it ran out of disk, but there’s some special code that handles running out of space that I’m not familiar with, so it might do something smarter (e.g., error out the writes).  Also, just FYI, MongoDB preallocates journal files as well as data files, so you’d start seeing failures as soon as the preallocation failed.

    3) Yes.  The journal has instructions like “write byte X to offset Y in file Z.”  The oplog is more like “write document {…} to collection W.”  More human readable, but each member must be run with journaling to be crash safe.

  • Jose Sebastian Battig

    Great article!
    Now a question, is there a “private view” per connection ? Or is the “private view” shared between connections? I know making the private view “shared” doesn’t make any sense by the name of “private view” vs “shared view”, but it’s important to understand.
    A behavior we are seeing on automated tests is that on highly concurrent read/write scenarios, even after a flush to journal on a writer thread another reader thread that is trying to fetch the same object doesn’t seem to get a fresh version until a “period” of time (very short indeed).
    Is this because “private views” are private per connection so until data makes it to shared view they are not visible to the rest of the world?

  • kristina1

    Thanks!  The same private view is used by all connections, but that can’t cause the issue you’re seeing.  Any write is immediately viewable by readers as soon as it has been written (well before it has been flushed or remapped).  

    Generally, the issue in this type of test is that you need to set write concern to wait for a DB response before expecting a reader to find the write.  If write concern is not set properly, the client will continue “successfully” before the DB has actually performed the write.  If you’re still having problems, asking on the mailing list might be helpful (https://groups.google.com/forum/?fromgroups=#!forum/mongodb-user).

  • Jack Chan

    Why need the private view,? If I just write to the shared view and msync it, whats the difference between them?
    Thank you!

  • kristina1

    The OS can write data from the shared view to disk at anytime without telling MongoDB. Thus, if we just used the shared view, data could end up in the data files before being written to the journal file. That would make the journal essentially useless.

  • Steven Niu

    I have a question on ‘remap shared view to private view to prevent private view from getting too dirty’

    According to what I understood, on a write request, the data update sequence is: private view -> journal file -> shared view -> data files. So the data in the private view should not older than the shared view, why is remapping required? And does the remapping have risk of losing data?

  • kristina1

    > So the data in the private view should not older than the shared view, why is remapping required?

    Suppose you’ve just started MongoDB. The private view takes up basically no memory. Now, suppose you write a KB of data to MongoDB. Now the private view takes up 1 KB of memory. Now you write 23 MB. Now the private view is taking up 23.001 MB of space (23 MB + 1 KB). This continues to grow, the private view using more an more memory as you write more data. When MongoDB remaps the private view, it takes up (approximately) 0 space again.

    > And does the remapping have risk of losing data?

    No. Once the data is in the journal it is safe.

  • ymhctt

    in my opinion,“remaps the shared view to the private view” and Check Point in RDBMS,meaning almost,is not it?

  • kristina1

    No, the step where the journal appends the change description is the most similar to a checkpoint.

    The remapping is an optimization, it has nothing to do with durability.

  • Deepak Saxena

    Best One…!! :)

  • Viren

    Thanks for the great post but just trying to understand the concise advantage and disadvantage of MongoDB journalism

    As per advantage

    – All write are safe

    – Durability

    As of disadvantage

    not sure on this there is tradeoff of performance especially read operation when using journal

    can you share light on this as well

    Also when should I think of using journal since only when I’m concerned about data-consistency

    Also if possible for you how would you like to answer this question

    http://dba.stackexchange.com/questions/49956/mongodb-advantages

  • kristina1

    The advantage is writes are durable, the disadvantage is writes are slower. Journaling shouldn’t affect read speed.

  • iamthevillageidiot

    Sorry, I know this is old but I’m still not clear on what that means… When a write request hits the primary, does that make two journal entries at the same time – one for the oplog collection and one for the intended collection? Or is the intended collection data change driven from the oplog collection? Or something else?

  • kristina1

    No problem, glad people are still finding it useful! Yes, two journal entries are flushed at the at the same time.

  • dfm

    Does MongoDB issue a remap (shared view to private view) only after writing all the changes in the journal to the shared view? Does it block write access to private view when remapping is in progress?

  • dfm

    So, is the remap done after all the updates from journal entries have made it to the Shared view?

  • kristina1

    Yes.

  • Ganesh Chandrasekaran

    Thanks for the detailed explanation, Kristina. The follow up comments are even more informative. I have attached an image from my understanding of all this. Can you let me know if this is correct?

  • kristina1

    You’re welcome! The diagram is almost correct. The oplog is not a separate component: the writes to it are journaled/written to the data files at the same time the “normal” writes are. So, you can get rid of that box/arrow altogether. Also, the secondary nodes don’t get the data from the data files on disk, but from the private view.

kristina chodorow's blog