Return of the Mongo Mailbag

On the mongodb-user mailing list last week, someone asked (basically):

I have 4 servers and I want two shards. How do I set it up?

A lot of people have been asking questions about configuring replica sets and sharding, so here’s how to do it in nitty-gritty detail.

The Architecture

Prerequisites: if you aren’t too familiar with replica sets, see my blog post on them. The rest of this post won’t make much sense unless you know what an arbiter is. Also, you should know the basics of sharding.

Each shard should be a replica set, so we’ll need two replica sets (we’ll call them foo and bar). We want our cluster to be okay if one machine goes down or gets separated from the herd (network partition), so we’ll spread out each set among the available machines. Replica sets are color-coded and machines are imaginatively named server1-4.

Each replica set has two hosts and an arbiter. This way, if a server goes down, no functionality is lost (and there won’t be two masters on a single server).

To set this up, run:

server1

$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --replSet foo
$ ./mongod --dbpath ~/dbs/bar --port 27019 --replSet bar --oplogSize 1

server2

$ mkdir -p ~/dbs/foo
$ ./mongod --dbpath ~/dbs/foo --replSet foo

server3

$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --port 27019 --replSet foo --oplogSize 1
$ ./mongod --dbpath ~/dbs/bar --replSet bar

server4

$ mkdir -p ~/dbs/bar
$ ./mongod --dbpath ~/dbs/bar --replSet bar

Note that arbiters have an oplog size of 1. By default, oplog size is ~5% of your hard disk, but arbiters don’t need to hold any data so that’s a huge waste of space.

Putting together the replica sets

Now, we’ll start up our two replica sets. Start the mongo shell and type:

> db = connect("server1:27017/admin")
connecting to: server1:27017
admin
> rs.initiate({"_id" : "foo", "members" : [
... {"_id" : 0, "host" : "server1:27017"},
... {"_id" : 1, "host" : "server2:27017"},
... {"_id" : 2, "host" : "server3:27019", arbiterOnly : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}
> db = connect("server3:27017/admin")
connecting to: server3:27017
admin
> rs.initiate({"_id" : "bar", "members" : [
... {"_id" : 0, "host" : "server3:27017"},
... {"_id" : 1, "host" : "server4:27017"},
... {"_id" : 2, "host" : "server1:27019", arbiterOnly : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

Okay, now we have two replica set running. Let’s create a cluster.

Setting up Sharding

Since we’re trying to set up a system with no single points of failure, we’ll use three configuration servers. We can have as many mongos processes as we want (one on each appserver is recommended), but we’ll start with one.

server2

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000

server3

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000

server4

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000
$ ./mongos --configdb server2:20000,server3:20000,server4:20000 --port 30000

Now we’ll add our replica sets to the cluster. Connect to the mongos and and run the addshard command:

> mongos = connect("server4:30000/admin")
connecting to: server4:30000
admin
> mongos.runCommand({addshard : "foo/server1:27017,server2:27017"})
{ "shardAdded" : "foo", "ok" : 1 }
> mongos.runCommand({addshard : "bar/server3:27017,server4:27017"})
{ "shardAdded" : "bar", "ok" : 1 }

Edit: you must list all of the non-arbiter hosts in the set for now. This is very lame, because given one host, mongos really should be able to figure out everyone in the set, but for now you have to list them.

Tada! As you can see, you end up with one “foo” shard and one “bar” shard. (I actually added that functionality on Friday, so you’ll have to download a nightly to get the nice names. If you’re using an older version, your shards will have the thrilling names “shard0000” and “shard0001”.)

Now you can connect to “server4:30000” in your application and use it just like a “normal” mongod. If you want to add more mongos processes, just start them up with the same configdb parameter used above.

  • Eyajudah

    Am amazed that with install base of windows, you guys dont deem it fit to get this tutorials in windows…..Mongodb is blazing the trail I will be suprised if they do not learn from the mistake of MySQL who are just waking up to pursue windows seriously now….You feel everything is Linux…please have a change of mind…

  • Anonymous

    This is my personal blog and I hate using Windows, so I don’t generally do examples with it. However, it’s not exactly hard to translate: use c:dbs… instead of ~/dbs/… and use mongod instead of ./mongod. That’s it, now it works on Windows!

  • Pingback: Things I Read This Week – 2010.08.03 | Jeremiah Peschka()

  • Anonymous

    I was just wondering how do config servers discover each other, so that they can keep in sync?

    Does mongos tell them all, when it is passed a list of config servers? What happens if I launch another mongos with only one config server specified? And what if I launch yet another mongos and add a freshly launched config server to the list?

    Documentation on this part is pretty scarce, not to say nonexistent.

  • Anonymous

    I was just wondering how do config servers discover each other, so that they can keep in sync?

    Does mongos tell them all, when it is passed a list of config servers? What happens if I launch another mongos with only one config server specified? And what if I launch yet another mongos and add a freshly launched config server to the list?

    Documentation on this part is pretty scarce, not to say nonexistent.

  • Anonymous

    Once they know about each other (i.e., you start one mongos listing all three) they can pass messages among themselves. If you start up a mongos with one config specified, it will contact that one and find out about the other two.

    At this point, you can’t really add a “freshly launched config server” to the mix. You’re stuck with the 3 servers you originally specified, so use dynamic hostnames! (This will be configurable in later versions.)

  • Jon

    I was under the impression that when adding a replica set as a shard, you had to specify the whole string of servers in the replica set, ie
    mongos.runCommand({addshard : “foo/server1:27017,foo/server2:27017”})
    has the syntax changed? and if it is the full list, do you need to add the arbiter in there as well?

  • Anonymous

    See the edit section I added earlier today. For now, you have to specify all of the non-arbiter servers (anyone who could become master).

    The syntax for specifying multiple members of a replica set is replSetName/host1,host2,…,hostN. Do not specify the replica set name before each host.

  • mark

    When you’re running a replica set as a shard, do you still need to specify –shardsvr when starting them up? Or rather is there any downside to not specifying shardsvr? http://www.mongodb.org/display/DOCS/Configuring+Sharding#ConfiguringSharding-ShardServers and http://blog.boxedice.com/2010/08/03/automating-partitioning-sharding-and-failover-with-mongodb/ use it, but your example doesn’t…

  • Anonymous

    –shardsvr is not necessary and drives me bananas. It does absolutely nothing other than change the port to 27018. I generally don’t use it (and tell people not to use it) because it’s hard enough for people to wrap their heads around sharding without making them remember extra options that don’t really do anything.

  • Mark

    In my setup, when I talk to the mongos server, it seems like it always routes the request to the master. How can I make use of the slave to scale read performance of the shard? db.getMongo().setSlaveOk() on the mongos process doesn’t seem to have an affect on where the request is routed. Connecting directly to the slave in the shard works, but requires knowing which shard and seems to defeat the purpose of using mongos.

  • Anonymous

    Mongos automatically routes reads to slaves as of version 1.7.4 (which was released this week, so it’s unlikely you’re running yet). See http://groups.google.com/group/mongodb-user/browse_thread/thread/f4e9b756ec3ecbd7.

  • Mark

    Ah, how timely! If only I’d asked a few days ago, I would have looked prophetic instead of uninformed.. 🙂

    Btw, I love how your blog ranks better than the official MongoDB documentation for searches like “mongo replica shard”.

  • Anonymous

    Haha, seriously? Awesome! I’ll have to cover some more topics, see if I can beat out the rest of the wiki.

  • JohnC

    I’m having a hard time understanding the real purpose of listing ALL hostnames (for a given replica set) in the shard list, ie replSetName/host1,host2,…,hostN. Isn’t it enough to just specify one of the hostname in the replica set, since the data is replicated across all hosts in the replica set?

  • Anonymous

    That was true six months ago, but you can now list just one server in the seed list: MongoDB is smart enough to figure out the other members now.

  • JohnC

    Thanks for the quick reply! Just to be absolutely clear, your answer applies to the most recent release of MongoDB v1.6.5, right? We’re still using v1.6… that means we need to specify the entire replica set in the seed list, right? Thanks again.

  • Anonymous

    No, I think Kristina was talking about 1.8. It’s almost released 🙂

  • JohnC

    Got it, thanks! One more basic question about v1.6. Conversely, if the primary host in the replica set is missing from the shard seed list, does that mean mongos will never give write operations to that shard? For example, replica setA consists of A1, A2, A3, where A1 is currently the primary. However, the shard seed only specified 2 of the 3, as in: setA/A2,A3. Please let me know if this is all documented somewhere. I just couldn’t find it… Thanks in advance.

  • Anonymous

    I think that in 1.6 it just won’t work. I’m not sure if 1.6.5 can even distribute reads to slaves. I’d highly suggest giving all members unless you’re using 1.7.5+.

  • Hi,
    I am getting a “ERROR: config servers 10.179.77.89:2000 and 10.179.79.62:2000 differ” … “configServer startup check failed” error when starting a mongos server. There is no mention of a problem with the third config server 10.179.77.195:2000. I have tried to stop 10.179.77.89:2000 and 10.179.79.62:2000, repair and restart but I still get the problem. I find that if I stop 10.179.77.89:2000 and try to start my mongos server everything works fine but I am without my config server. Any idea of how to solve this problem?

    I have seen the suggestion (http://groups.google.com/group/mongodb-user/browse_thread/thread/167e4de3e69f5c8c/223eb94fa66b5185) of stopping all the config servers, deleting the config files, and restarting should fix the problem. Will this cause any problems though?

  • Anonymous

    Don’t delete all your config files! It sounds like one of the config servers somehow ended up with different data than the other two. (Was anything connecting directly to that config server? Could there have been anything writing to it?) Shut down one of your other config servers, make a copy of its data directory, and put that on your messed up config server.

  • Thanks. That worked! Not sure went wrong but will keep an eye out for it happening again

  • Excellent work! But I failed. I have done on your instructions. All right, all data is distributed, I worked with them. But I can not make a database dump. I get the following error:

    [moskrc@server9 test]$ /opt/mongodb/bin/mongodump -d mycms-prod-3
    connected to: 127.0.0.1
    DATABASE: mycms-prod-3 to dump/mycms-prod-3
    mycms-prod-3.cms_comment to dump/mycms-prod-3/cms_comment.bson
    16 objects
    mycms-prod-3.system.indexes to dump/mycms-prod-3/system.indexes.bson
    64 objects
    mycms-prod-3.cms_pdfcontent to dump/mycms-prod-3/cms_pdfcontent.bson
    18 objects
    mycms-prod-3.djangoratings_vote to dump/mycms-prod-3/djangoratings_vote.bson
    25 objects
    mycms-prod-3.auth_permission to dump/mycms-prod-3/auth_permission.bson
    192 objects
    mycms-prod-3.tracking_pagevisit to dump/mycms-prod-3/tracking_pagevisit.bson
    assertion: 11010 count fails:{ assertion: “setShardVersion failed host[109.231.122.41:28000] { errmsg: “not maste…”, assertionCode: 10429, errmsg: “db assertion failure”, ok: 0 }
    [moskrc@server9 tes]$

    This mongos.log

    Wed Apr 13 04:03:09 [conn1] setShardVersion failed host[109.231.122.41:28000] { errmsg: “not master”, ok: 0.0 }
    Wed Apr 13 04:03:09 [conn1] Assertion: 10429:setShardVersion failed host[109.231.122.41:28000] { errmsg: “not master”, ok: 0.0 }
    0x51f4a9 0x69b163 0x69acf2 0x69acf2 0x69acf2 0x576ba6 0x5774b6 0x575630 0x575b31 0x65f661 0x57bdcc 0x631062 0x66432c 0x6761c7 0x57ea3c 0x69ec30 0x3a9be0673d 0x3a9b6d40cd
    /opt/mongodb/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x51f4a9]
    /opt/mongodb/bin/mongos [0x69b163]
    /opt/mongodb/bin/mongos [0x69acf2]
    /opt/mongodb/bin/mongos [0x69acf2]
    /opt/mongodb/bin/mongos [0x69acf2]
    /opt/mongodb/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5_S7_bi+0x16) [0x576ba6]
    /opt/mongodb/bin/mongos(_ZN5mongo17ClientConnections13checkVersionsERKSs+0x1c6) [0x5774b6]
    /opt/mongodb/bin/mongos(_ZN5mongo15ShardConnection5_initEv+0x2d0) [0x575630]
    /opt/mongodb/bin/mongos(_ZN5mongo15ShardConnectionC1ERKNS_5ShardERKSs+0xa1) [0x575b31]
    /opt/mongodb/bin/mongos(_ZN5mongo15dbgrid_pub_cmds8CountCmd3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0x9e1) [0x65f661]
    /opt/mongodb/bin/mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderE+0x67c) [0x57bdcc]
    /opt/mongodb/bin/mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x262) [0x631062]
    /opt/mongodb/bin/mongos(_ZN5mongo7Request7processEi+0x29c) [0x66432c]
    /opt/mongodb/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x77) [0x6761c7]
    /opt/mongodb/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x34c) [0x57ea3c]
    /opt/mongodb/bin/mongos(thread_proxy+0x80) [0x69ec30]
    /lib64/libpthread.so.0 [0x3a9be0673d]
    /lib64/libc.so.6(clone+0x6d) [0x3a9b6d40cd]

    Help me please.

    My system:
    OS Linux, CentOS 5.5, kernel: Linux 2.6.18-194.el5xen #1 SMP Fri Apr 2 15:34:40 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
    MongoDB – 1.8.1

    Thanks

  • Anonymous

    Hmm, I thought this was a bug that was fixed for 1.8.1… are you sure you’re running 1.8.1? The mailing list is generally a better place for debugging a problem like this.

kristina chodorow's blog