On MapReduce in MongoDB

MapReduce can be a confusing concept to understand at first. It’s not that it’s some terribly complex thing, but rather when you work with it in tools like MongoDB, you’re simply not exposed to all the pieces. I’m hoping this post helps to put the puzzle together…

In MongoDB, a Map function takes a collection of documents and produces a new collection of key/value pairs. The key obviously remains unique, while the value is an array that gets appended to with each value emitted by the map function. The Reduce function is then used to examine each of the values from the new key/value set.

Consider the common example of a blog post and its tags. A common task is to count how many times a tag appears across all posts. With SQL, a simple GROUP BY and COUNT will get that answer. In NoSQL databases like CouchDB and MongoDB that use MapReduce a different approach is needed.

First, let’s create some data:

db.Posts.insert({ Name : "On Installing MongoDB as a Service On Windows",
 Tags : ["mongodb"] });
db.Posts.insert({ Name : "On Running NerdDinner on MongoDB with NoRM",
Tags : ["mongodb", "norm", "mvc"] });
db.Posts.insert({ Name : "On A Simple IronPython Route Mapper for ASP.NET MVC",
Tags : ["mvc", "ironpython"] });

Now we have a simple collection of Posts, where each document has a Name field and an array field named Tags.

1
2
3
4
5
6
 var map = function() {
                if (!this.Tags) { return; }
                for (var index in this.Tags) {
                    emit(this.Tags[index], 1);
                }
            };

The Map function will examine each of the Tags arrays for each of the documents in the collection. For each of the tags found in each Post document, we simply add a 1 to the value array. In other words, we’re conceptually creating a set that looks like:

1
2
3
4
5
6
{
   "ironpython" : [1],
   "mongodb" : [1, 1],
   "mvc" : [1, 1],
   "norm" : [1]
}

The call to emit is what’s creating that new key/value series. This new collection is then fed to reduce.

1
2
3
4
5
6
7
var reduce = function(key, vals) {
                var count = 0;
                for (var index in vals) {
                    count += vals[index];
                }
                return count;
             };

Reduce will take each tag and sum up its vals array, which in this case simply contains a 1 for each occurrence of a tag. Conceptually, the input to reduce is:

reduce("mvc", [1, 1]);
reduce("ironpython", [1]);

Calling the mapreduce command produces a collection where each tag is paired with its count. The new collection has each tag as an _id field and each count as a value field.

1
2
3
4
5
6
7
var result = db.runCommand(
    {
        mapreduce : "Posts",
        map : map,
        reduce : reduce,
        out : "Tags"
   });

The first couple of times I looked at the MapReduce docs for MongoDB, it was unclear what was happening. Understanding the outputs and inputs of each function as well as when each is invoked is a critical thing to understand.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

3 Responses to On MapReduce in MongoDB

  1. Pingback: Twitter Trackbacks for dll Hell .net » On MapReduce in MongoDB [dllhell.net] on Topsy.com

  2. Rajesh Koilpillai says:

    One of the tag names has a typo in it, it should be ironpython (on the 3rd insert)

  3. Oops… Thanks, fixed.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">