Oh, the Mistakes I’ve Seen

A slow database is easily fixed
If you make good choices of fields indexed.
Sometimes the answer is simpler still,
A quick code change may fit the bill.

I’ll be giving an O’Reilly webcast, Scaling with MongoDB, on Friday (9/17). Please sign up if you’re interested in learning some more advanced optimization than what this post gets into. This webcast is, in part, to pimp MongoDB: The Definitive Guide, which will be coming out next week!

These are a few basic tips on making your application better/faster/stronger without knowing anything about indexes or sharding.

Connecting

Connecting to the database is a (relatively) expensive operation. Try to minimize the number of times you connect and disconnect: use persistent connections or connection pooling (depending on your language).

To not waste connections, you have to know what your driver is doing. I see a lot of code like this in PHP:

$connection = new Mongo();
$connection->connect();

What this does is:

  1. The constructor connects to the database.
  2. connect() sees that you’re already connected, assumes you want to reset the connection.
  3. Disconnects from the database.
  4. Connects again.

Gah! You just doubled your execution time.

ObjectIds

ObjectIds seem to make people vaguely uncomfortable, so they convert their ObjectIds into strings (the macaroni and cheese of data types). The problem is, an ObjectId takes up 12 bytes but its string representation takes up 29 bytes (almost two and a half times bigger). The lesson: suck it up and eat your spinachy ObjectIds. You’ll learn to like ’em.

Also, an ObjectId won’t sneakily convert itself into a string on the fly. I see a lot of code like:

id = new ObjectId();
db.foo.insert({"_id" : new ObjectId(id)});
// or, even sillier
db.foo.insert({"_id" : new ObjectId(id.toString())});

If you created an ObjectId and haven’t messed with it, it’s still an ObjectId.

Numbers vs. Strings

MongoDB is type-sensitive and it’s important to use the correct type: numbers for numeric values and strings for strings.

If you have large numbers and you save them as strings (“1234567890” instead of 1234567890), MongoDB may slow down as it strcmps the entire length of the number instead of doing a quicker numeric comparison. Also, “12” is going to be sorted as less than “9”, because MongoDB will use string, not numeric, comparison on the values. This can lead to some surprising results.

Driver-specific

Find out if you’re driver is particularly weaknesses (or strengths). For instance, the Perl driver is one of the fastest drivers, but it sucks at decoding Date types (Perl’s DateTime objects take a long time to create). So, if you want fast Perl programs, avoid dates like the plague or you’ll be puttering along with the Ruby programmers. (Just kidding, Rubyists! Sort of.)

The most important thing is to get to know your language’s documentation and ask if you have any questions.

  • The problem with ObjectIds is that they are MongoDB-specific and not URL-friendly. If I access an object by URL (e.g. in a web admin app: “/admin/edit/{id}/”), I will obviously convert ObjectId to a UUID-like string before inserting it to URL. Okay, that works, but what if I have another database (DBD, Tokyo Cabinet, whatever) with, say, UUIDs as primary keys and import data from that database to a MongoDB collection? Well, I can reset the keys, but then all references within the database are lost. So I keep the keys. And then at some point I get an ObjectId(“4c506c1ca3baa4169c000006”) and a “4c506c1ca3baa4169c000006” in the same collection. And they are considered different primary keys until I convert them to strings, when — whoa! — they clash and we have all the pain in the assets. So there are cases when it is better to keep all primary keys as strings (unless I’ve missed something). But, as I understand, there’s no way to ensure this on the DB level. So we have to tell the ObjectId-derived strings from “plain” strings. I’ve chosen to prefix Mongo’s ids with “x-objectid-” and leave “plain” ids alone. The prefixed ones are cropped and converted to ObjectIds on find/save, and non-prefixed are interpreted as plain strings. Ugly but more or less safe. 🙁

  • Anonymous

    I was mostly trying to argue that people should get used to ObjectIds and respect them as their own type. If you know what you’re doing, I think it’s fine to save them as strings or use another type altogether. It sounds like you have a special case (mixing data from a few different db sources).

    As far as URLs go, ObjectIds are a bit unwieldy. A popular method is to base 64 encode the string value (doesn’t help with clashes, but it looks nicer).

  • Sam Sen

    I recall leading zeros in a field is a no-no with mongo. With zipcodes, I had to encapsulate the zipcodes within quotes and turn them into a string. For example, 19406 works but 08056 turns into 8056 when I insert.

  • Anonymous

    Yeah, you can’t store leading 0s with an integer because 01 is the same as 1, numerically. Same reason you can’t store leading zeros in an int in C or something.

  • I would like to use real MongoIDs (so no strings) as keys for arrays. Is that possible? And if it is possible in native MongoDB, is there some way to get this working in the PHP driver? As far as I know PHP isn’t able to handle objects as array keys.

  • Anonymous

    This is a limitation of both PHP (array keys must be strings or ints) and MongoDB (array keys must be strings or ints), so you’re pretty much screwed. You could use the string form or save it as a non-key.

  • Okay, thank you! I expected that but wanted to be sure. Anyway, keep on your good work!

  • Mark

    PHP doesn’t natively support long’s, so using large numbers in their numeric form in your application (eg. Facebook IDs), can wreak havoc when you suddenly have a float instead of the whole number that you were expecting. Therefore forcing the ID to be a string (as Facebook now has in their Graph API) can save a lot of headaches (unfortunately at the cost of space and the loss of being able to do numeric operators).

    Also, if you know you’re creating a unique index, I think you can save index space by just using the keys of the unique index as the ‘_id’ value. Otherwise you will have your unique index and the (potentially unneeded) default MongoId index.

  • Anonymous

    You can also use MongoInt64 to store 8-byte integers.

  • Mark

    Yeah, Mongo can handle the large numbers, but somewhere down the road, you’ll likely shoot yourself in the foot by taking that large number from Mongo, and then creating a URL of the form, http://url.com/id/1.23456e+15.. 🙂

    Of course, you’d have the same problem using MySQL or something else. I wish PHP handled seamlessly handled longs like Perl..

  • jpmckinney

    Um, Ruby is faster than Perl, actually http://benchmarksgame.alioth.debian.org/

  • kristina1

    At the time this was written (over 2 years ago) the MongoDB Perl driver was much faster than the Ruby driver.  

  • Michael Powell

    The beef I have with the MongoDB ObjectId is simply this: why not use UUID? Why the necessity to invent a brand new, basically-the-same byte-array-based-id, when the UUID is ubiquitous in just about any platform. Even some flavors of Linux have a uuidgen, for crying out loud. Anyhow, good article, thank you for the insights.

  • kristina1

    Thanks! I agree that it’s reinventing the wheel, but UUIDs would actually incur a performance penalty. ObjectIds include a timestamp so they “increase” over time which makes indexing them easier. UUIDs could not be indexed as quickly.

  • Michael Powell

    No problem. Sure there’s the 16-byte versus 12-byte cost. The benefit of UUID are the flexibility of which strategy to use, including date/time stamps. Depending on the ‘indexing’ used, can also be very fast in a database engine.

  • kristina1

    Ah, cool, I didn’t know about timestamp-based UUIDs. Yeah, that would probably have been a better choice.

  • Michael Powell

    It’s strategy-dependent, but usually involves some sort of sequential nature. Check out SQL Server NEWSEQUENTIALID, for instance, which has the express purpose of avoiding fragmented indexes.

kristina chodorow's blog