MongoDb Map Reduce

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results.

For map-reduce operations, MongoDB provides the map Reduce database command.

The mapReduce command allows you to run map-reduce aggregation operations over a collection. The mapReduce command has the following prototype form:


db.runCommand(
     {
               mapReduce: <collection>,
               map: <function>,
               reduce: <function>,
               finalize: <function>,
               out: <output>,
               query: <document>,
               sort: <document>,
               limit: <number>,
               scope: <document>,
               verbose: <boolean>
     }
)

 

Pass the name of the collection to the mapReduce command (i.e. <collection>) to use as the source documents to perform the map reduce operation.

The command also accepts the following parameters:

Field Description
mapReduce The name of the collection on which you want to perform map-reduce. This collection will be filtered using query before being processed by the map function.
map A JavaScript function that associates or “maps” a value with a key and emits the key and value pair.
reduce A JavaScript function that “reduces” to a single object all the values associated with a particular key.
out Specifies where to output the result of the map-reduce operation. You can either output to a collection or return the result inline.
query Optional. Specifies the selection criteria using query operators for determining the documents input to the map function.
sort Optional. Sorts the input documents. This option is useful for optimization. For example, specify the sort key to be the same as the emit key so that there are fewer reduce operations. The sort key must be in an existing index for this collection.
limit Optional. Specifies a maximum number of documents for the input into the map function.
finalize Optional. Follows the reduce method and modifies the output.
scope Optional. Specifies global variables that are accessible in the map, reduce and finalize functions.
verbose Optional. Specifies whether to include the timing information in the result information. The verbose defaults to true to include the timing information.

 

The following is a prototype usage of the mapReduce command:

var mapFunction = function() { ... };
var reduceFunction = function(key, values) { ... };
db.runCommand(
{
      mapReduce: <input-collection>,
      map: mapFunction,
      reduce: reduceFunction,
      out: { merge: <output-collection> },
      query: <query>
}
)

Requirement for map function:
Map function is responsible for transforming each input document in to zero or more documents.It can access the variables defined in the scope parameter,and has following prototypes.

function(){
     ...
     emit(key,value);
}

The map function has the following requirements:

  • In the map function, reference the current document as this within the function.
  • The map function should not access the database for any reason.
  • The map function should be pure, or have no impact outside of the function (i.e. side effects.)
  • A single emit can only hold half of MongoDB..
  • The map function may optionally call emit(key,value) any number of times to create an output document associating key with value.

The following map function will call emit(key,value) either 0 or 1 times depending on the value of the input document’s status field:

function(){   
   if(this.status=='A')       
      emit(this.cust_id,1);
}

The following map function may call emit(key,value) multiple times depending on the number of elements in the input document’s items field:

function(){this.items.forEach(function(item){emit(item.sku,1);});}

Requirements for the Reduce Function
The reduce function has the following prototype:


     function(key,values){
         ...
         return result;
}

The reduce function exhibits the following behaviors:

  • The reduce function should not access the database, even to perform read operations.
  • The reduce function should not affect the outside system.
  • MongoDB will not call the reduce function for a key that has only a single value. The valuesargument is an array whose elements are the value objects that are “mapped” to the key.
  • MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
  • The reduce function can access the variables defined in the scope parameter.
  • The inputs to reduce must not be larger than half of MongoDB’s. This requirement may be violated when large documents are returned and then joined together in subsequent reduce steps.

Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:

  • the type of the return object must be identical to the type of the value emitted by the mapfunction.
  • the reduce function must be associative. The following statement must be true:
reduce(key,[C,reduce(key,[A,B])])==reduce(key,[C,A,B])
    • the reduce function must be idempotent. Ensure that the following statement is true:
reduce(key,[reduce(key,valuesArray)])==reduce(key,valuesArray)
    • the reduce function should be commutative: that is, the order of the elements in thevaluesArray should not affect the output of the reduce function, so that the following statement is true:
reduce(key,[A,B])==reduce(key,[B,A])

Requirements for the finalize Function

The finalize function has the following prototype:

 function(key,reducedValue){
          ...
          return modifiedObject;
}

The finalize function receives as its arguments a key value and the reducedValue from thereduce function. Be aware that:

  • The finalize function should not access the database for any reason.
  • The finalize function should be pure, or have no impact outside of the function (i.e. side effects.)
  • The finalize function can access the variables defined in the scope parameter.

out Options

You can specify the following options for the out parameter:

Output to a Collection

This option outputs to a new collection, and is not available on secondary members of replica sets.

out:<collectionName>

Map-Reduce Examples:

Consider two Collection (tables) named :

  • Employee
  • Department

Now , to create collection in mongo db , use below query

db.createCollection(“Employee”)
db.createCollection(“Department”)

To insert data in Employee Collection :

db.Employee.insert({“name” : { “first” : “John”, “last” : “Backus” }, “city” : “Hyd”,“department” : 1})

db.Employee.insert({“name” : { “first” : “Merry”, “last” : “Desuja” }, “city” : “Pune”,“department” : 2})

To insert data in Department Collection :

db.Department.insert({“_id” : 1,   “department” : “Manager”})

db.Department.insert({“_id” : 2,   “department” : “Accountant”})

Now the requirement is to display FirstName , LastName , DepartmentName.

For this , we need to use Map Reduce :

Create two map functions for both the collections.

//map function for Employee
var mapEmployee = function () {
var output= {departmentid : this.department,firstname:this.name.first, lastname:this.name.last , department:null}
     emit(this.department, output);               
};

//map function for Department
var mapDepartment = function () {
var output= {departmentid : this._id,firstname:null, lastname:null , department:this.department}
     emit(this._id, output);              
 };

Write Reduce Logic to display the required fields :



var reduceF = function(key, values) {

var outs = {firstname:null, lastname:null , department:null};

values.forEach(function(v){
      if(outs.firstname ==null){outs.firstname = v.firstname }                   
      if(outs.lastname ==null){outs.lastname = v.lastname    }
      if(outs.department ==null){ outs.department = v.department }                         
 });   
 return outs;
};

Store the result into a different collection called emp_dept_test


result = db.employee_test.mapReduce(mapEmployee, reduceF, {out: {reduce: ‘emp_dept_test’}}) 
result = db.department_test.mapReduce(mapDepartment,reduceF, {out: {reduce: ‘emp_dept_test’}})

write the following command to get combined result:

db.emp_dept_test.find()

Output of the query gives the combined result like


{
    "_id" : 1,
    "value" : {
        "firstname" : "John",
        "lastname" : "Backus",
        "department" : "Manager"
    }
}

/* 1 */
{
    "_id" : 2,
    "value" : {
        "firstname" : "Merry",
        "lastname" : "Desuja",
        "department" : "Accountant"
    }
}

 

-By
Nitin Uttarwar
Helical It Solution

Leave a Reply