––thursday #5: diagnosing high readahead

Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

Picture our disk as looking like this, where each o is a document:

...
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
... // keep going for millions more o's

Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

...
ooooooooooooooooooooooooo
ooooxoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

...
ooooooooooooooooooooooooo
ooooxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

Is your OS returning this with every document?

Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.

  • Jeremy Wilson

    How can you adjust readahead at the OS level?

  • Pingback: Sysadmin Sunday 79 « Server Density Blog()

  • kristina1

    Yes, and technically you’re setting it at the block device level. See the previous post, http://www.snailinaturtleneck.com/blog/2012/04/05/thursday-4-blockdev/, for more info on setting readahead.  

  • Guest

    Very informative and useful post!

    Is the resident size reported in MMS the same as the resident memory
    usage reported by the top command. (I don’t use MMS). Running the top command on my mongodb server indicates that 26 GB of resident memory out of 32
    GB is being used by the mongod process. Virtual memory usage is around 175GB. Read ahead is set to 256 on the mongodb server which seems high to me given that the average document size is around 512 bytes and the reads and writes are totally random (several hundreds per second). IO stat indicates that the disk utilization is almost always 100%.

    It would be great to hear your thoughts on whether reducing the read ahead would be beneficial in this scenario and anything else that would help in utilizing the RAM better. Are there any drawbacks of setting the read ahead to a low value like 10 or disabling read aheads?

  • kristina1

    Thank you! I think top’s resident size is the same as what MMS uses.  256 seems high for small docs if you have a random-access reads/writes.  I’d recommend lowering readahead gradually (try 128, 64, etc.) and see if you see improvement.  There can be drawbacks: if you’re doing sequential disk accesses, lowering readahead will hurt performance.

  • Guest

    Great! Thank you for your reply. I’ll lower the readahead and monitor the page faults & disk utilization. Hopefully things will improve. We have a lot of writes – new documents are inserted, existing documents are updated that cause the object to grow over time and old documents are periodically deleted / evicted as a result of which the memory and disk would be pretty fragmented. Periodic compaction is needed but it means taking the node offline which isn’t ideal.
    As the data set exceeds the RAM and the access pattern is totally random, optimizing RAM utilization would be very helpful and  hopefully options like tweaking the readahead settings and any other such options would be beneficial.

    Your blog is very informative and well written. Keep up the good work!

  • kristina1

    Thank you!

kristina chodorow's blog