Replica Set Internals Bootcamp: Part II – Creating a Set

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Initializing a Set

Suppose you are creating a new set. You start up some mongods with no data and the replSet option. When a server starts up with the replSet option, the first thing it does is check its local.system.replset collection for a replica set config it can load, but it won’t find one (because there’s no data). Since it cannot find a config, it goes into the EMPTYCONFIG state. If you look in the logs, you’ll see messages about EMPTYCONFIG.

Tue Mar  6 12:24:35 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:35 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Tue Mar  6 12:24:45 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:55 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:05 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:15 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:25 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now suppose you send the replSetInitiate command to one of the members. replSetInitiate can either take a custom configuration or generate a config automatically. If you do not pass in a config, the server will try to figure out what its hostname is and generate a config using that.

Note: on EC2, the server always chooses the wrong hostname, so you have to pass in the config you want.

Once the server has a config document, either passed in or generated, it’ll make sure it can reach every node specified (all members must be up and reachable to initiate), then store the config in the local.system.replset collection. Finally, it begins sending heartbeat requests to the other members of the set.

The Log: Step-by-Step

Being able to interpret log files is critical to knowing what’s going on, so let’s walk through a sample log from running rs.initiate(). Here’s the whole log, feel free to skip over this for now. We’ll take it piece-by-piece below.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client
Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up
Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******
Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.
Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms
Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self
Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000
Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2
Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY
Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing
Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

Now, to understand this line-by-line. First, we start out in the EMPTYCONFIG state, as described above.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now the shell connects to the server and runs rs.initiate() (replSetInitiate), which apparently takes me 6 seconds to type (take that, Mavis Beacon).

Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client

Because we didn’t give a config, the server tries to generate one:

Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }

Now it does some sanity checks on this config (e.g., making sure all members are up and reachable). This matters more if we gave it a config, but sometimes things are so screwed up you can’t reach yourself.

Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up

Now it calculates the size of the oplog and allocates it between two lines of ******.

Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******

It doesn’t actually print the “…” part, I just omitted 21 files being preallocated.

Now it stores the configuration in the local.system.replset collection.

Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.

It seldom takes a minute for it to come online, but this way you’ll be pleasantly surprised.

replSetInitiate is now done running. It will often be logged, because slow operations are logged and allocating the oplog usually takes a while.

Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms

Now it actually loads this configuration as the replica set config we want to use:

Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self

You won’t see this next line if you’re running 2.0, I added it for 2.2 because I was sick of having to decipher which server a log was from:

Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000

Yay, now config has been loaded, making our state STARTUP2 (the “config has been loaded, starting up RS threads” state):

Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2

At this point, it can go straight into PRIMARY state, but it doesn’t. We could fix this, but it hasn’t been a priority, so it goes into SECONDARY state briefly before becoming PRIMARY:

Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY

A “real” secondary (we’re about to become primary, so we don’t count) will gather some stats about who’s best to sync from before it chooses a target:

Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing

Good luck with that, buddy, you’re the only one in the config.

Aha, we’ve noticed that we can become primary:

Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

And now we’re all set.

Propegating the Config

If we add members or provide an initial config with multiple members, the initial member’s heartbeat requests are sent to a bunch of servers in EMPTYCONFIG state. When these other servers receive the heartbeat request, they notice “Hey, I don’t have any config and this guy seems to think I’m part of his set, so I should check in with him.”

These EMPTYCONFIG members will then request a config from the member that requested the heartbeat. Seeing themselves in the config, they’ll save it to their own local.system.replset collection, then start sending out their own heartbeats. At that point, they’ll move into RECOVERING state and, once they’re synced to the original member, they’ll turn into secondaries (unless they’re arbiters, of course, but same general idea).

One quirk of initiating a set is that every member of the set must not have any data, other than the one you’re sending replSetInitiate to. After the set is initialized you can add members containing data, but at most one member can have data when you initialize the set.

Confused?

Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).

  • Yay, part 2! Thanks Kristina.

  • kristina1

     Thanks for letting me know!  It’s motivating to hear that people other than 10gen employees are interested 🙂

  • Gepo

    Looking forward to next part

  • Tom

    Noticed that you only have 3 comments, so I’d like to inform you that this information is extremely well received. I need to learn more about mongodb internals, especially about replicasets and this seems to be pretty much all that is available on the web.

  • kristina1

    Thanks, glad it helps!

  • Jane

    You have explained the log very nicely. It is easy to understand and implement. Thank you very much.

  • taf2

    I have a shared cluster of sorts… with 1 config server and 2 shard servers. The two shards start up with –shardsvr. It’s unclear from here whether we need to add the replica set information to the config server or to the shard servers? Trying to add –replSet rs0 to one of the shards yields the error: replSet can’t get local.system.replset config from self or any seed (EMPTYCONFIG)

    The message EMPTYCONFIG makes me wonder does this mean we need to start the config server with replSet rs0? Generally, I’m this confused but definitely set this cluster up incorrectly to start should have had some replica sets already in place before sharding…

  • kristina1

    Each shard can be a replica set (i.e., two separate replica sets, one per shard). Think of it this way: a replica set is in charge of making many copies of the same data. Shards are for splitting up data. So, if you have two shards, half of your data will be on one and half on the other. Then there will be N copies of that data, where N is the number of replica set members (in your case, N=1). Note that the config server _must not be_ a replica set.

    Once you have started a server with –replSet, you have to create a config for it (that’s why it’s complaining about EMPTYCONFIG).

    Please check out the documentation at mongodb.org on setting up sharding and replica sets, there are some examples that was through setting this up.

  • taf2

    Thanks – I figured this out as well and was able to pretty seamlessly add additional replicas. I believe it’s that there is no documentation clearly explaining what to do in the case of a sharded cluster without replicas all the documents assume sharded clusters come after setting up replicas

    It’s working great now – this blog post was really helpful in helping me figure out that each shard is what needs to be replicated… it makes perfect sense just in production i like to get confirmation before taking the leap 😀

  • kristina1

    Well, I think my book covers it pretty clearly, but then I would 🙂

    Glad it worked out!

kristina chodorow's blog