Here are some exercises to battle-test your MongoDB instance before going into production. You’ll need a Database Master (aka DM) to make bad things happen to your MongoDB install and one or more players to try to figure out what’s going wrong and fix it.
This was going to go into MongoDB: The Definitive Guide, but it didn’t quite fit with the other material, so I decided put it here, instead. Enjoy!
Tomb of Horrors
Try killing off different components of your system: mongos processes, config servers, primaries, secondaries, and arbiters. Try killing them in every way you can think of, too. Here are some ideas to get you started:
- Clean shutdown: shutdown from the MongoDB shell (db.serverShutdown()) or SIGINT.
- Hard shutdown: kill -9.
- Shut down the underlying machine.
- If you’re running on a virtual machine, stop the virtual machine.
- If you’re running on physical hardware, unplug the machine.
A slightly more difficult twist is to make these servers unrecoverable: decommission the virtual machine, firewall a box from the network, pick up a physical machine an hide it in a closet.
@markofu‘s suggestion: make netcat bind to 27017 so mongod can’t start back up again:
$ while [ 1 ]; do echo -e "MongoDB shell version: 2.4.0\nconnecting to: test\n>"; nc -l 27017; done
DM’s guide: make sure no data is lost.
The Adventure of the Disappearing Data Center
Similar to above, but more organized. You can either have a data center go down (shut down all the servers there) or you can just configure your network not to let any connections in or out, which is a more evil way of doing it. If you do this via networking, once your players have dealt with the data center going down, you can bring it back and make them deal with that, too.
Note that any replica set with a majority in the “down” data center will still have a primary when it comes back online. If your players have reconfigured the remainder of the set in another data center to be primary, these members will be kicked out of the set.
Find the Rogue Query
There are several types of queries that you can run that will pound on your system. If you’d like to teach operators how to track these types of queries down and kill them, this is a good game to play.
To test a query that stresses disk IO, run a query on a large collection that probably isn’t all in memory, such as the oplog. If you have a large, application-specific collection, that’s even better as it’ll raise less red-flags with the players as to why it’s running. Make sure it has to return hundreds of gigabytes of data.
Kicking off a complex MapReduce can pin a single core. Similarly, if you can do complex aggregations on non-indexed keys, you can probably get multiple cores.
Stressing memory and CPU can be done by building background indexes on numerous databases at the same time.
To be really tricky, you could find a frequently-used query that uses an index and drop the index.
DM’s guide: players should re-heat the cache to speed up the application returning to normal.
THAC0, aka Bad System Settings
Try setting readahead to 65,000 and watch MongoDB’s RAM utilization go down and the disk IO go through the roof.
Set slaveDelay=30 on most of your secondaries and watch all of your applications w: majority writes take 30 seconds.
Use rs.syncFrom() to create a replication chain where every server only has one server syncing from it (the longest possible replication chain). Then see how long it takes for w: majority writes to happen. How about if everyone is syncing directly from the primary?
What happens if your MongoDB instance gets more than it can handle? This is especially useful if you’re on a multi-tenant virtual machine: what’s going to happen to your application when one of your neighbors is behaving badly? However, it’s also good to test what might happen if you get a lot more traffic than you expect. You can use the Linux dd tool to write tons of garbage to the data volume (not the data directory!) and see what happens to your application.
Try using a script to randomly turn network on and off using iptables. For increased realism, it’s more likely that you’ll lose connectivity between data centers than within a data center, so be sure to check that.
Network issues will generally cause failovers and application errors. It can be very difficult to figure out what’s going on without good monitoring or looking at logs.