––thursday #4: blockdev

Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.

The image below helps visualize how slow (post continues below).

(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)

The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.

You can see how far the kernel reads ahead using the blockdev tool:

$ sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     80026361856   /dev/sda
rw   256   512  4096       2048     80025223168   /dev/sda1
rw   256   512  4096          0   2000398934016   /dev/sdb
rw   256   512  1024       2048        98566144   /dev/sdb1
rw   256   512  4096     194560      7999586304   /dev/sdb2
rw   256   512  4096   15818752     19999490048   /dev/sdb3
rw   256   512  4096   54880256   1972300152832   /dev/sdb4

Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:

$ man blockdev
...
       --setra N
              Set readahead to N 512-byte sectors.
...

This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.

So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).

Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.

I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.

  • Mordy Ovits

    “You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I
    mean, you may think your network is slow, but that’s just peanuts to
    disk IO.”

    Heh.  DNA FTW.

  • Are there any way, to collect some stats about average size of the blocks, read by the system from the disk?

  • kristina1

     🙂

  • kristina1

     Yes, iostat can show you info about what the disk is doing.  However, I’m not sure what the best way of correlating that to what MongoDB is using is.  I heard one suggestion that you could check how disk IO compared to how much was going into resident memory, but it seems like that only would work until you’ve filled up resident memory.

  • Testq

    i read your book – very well written. i have a question. Is there a concept of a temp db found in traditional db’ s? Is there something along the lines of staging vs. production databases? i would like to setup an environment where production data is in one db separate from a db where users can freely run ad-hoc queries for testing or learning purposes, during which process users may copy large chunks of production data to their test db? Given the memory-mapped nature of mongodb, is it possible to make such temp copies of your big production collections?

  • kristina1

    Thank you! There is no built-in mechanism for doing this with MongoDB. You’d probably want to take snapshots of a secondary and then use those to re-create “clean” staging dbs for people to play with. You might want to ask on https://groups.google.com/forum/#!forum/mongodb-user about this for more ideas.

kristina chodorow's blog