Mongo DB Aggregation – Part II – The Aggregation Framework

Mongo DB Aggregation(Part 2) – The Aggregation Framework
 

The Mongo DB aggregation framework or aggregation pipeline lets you define a data processing pipeline. It takes multiple documents in a collection, puts them through the multiple Stages we define and transforms them into an aggregated result. It allows us to do much more complex aggregation than the “single purpose aggregation operators”, that we saw in the first part of this blog (Part I).

Let’s dive into an example to see what we can do with this. Let’s take the same collection we used in the part I with some additional documents.


{ "_id": 1, "emp_dept": "Sales", joining_date: ISODate("2014-10-01T00:00:00.000Z"), "job_roles":[ "Manager", "Region Head" ], salary :{ "base": 2100, "variable": "800" }}
{ "_id": 2, "emp_dept": "Sales", joining_date: ISODate("2014-01-03T00:00:00.000Z"), "job_roles": [ "Analyst", "Sales Officer"], salary : { "base": 2000, "variable": 500 }}
{ "_id": 3, "emp_dept": "Finance", joining_date: ISODate("2015-01-31T00:00:00.000Z"), "job_roles": "Manager" , salary : { "base": 1000, "variable": 500 }}
{ "_id": 4, "emp_dept": "Sales", joining_date: ISODate("2015-03-03T00:00:00.000Z"), "job_roles": [ "Analyst"], salary : { "base": 1500, "variable": 500 }}
{ "_id": 5, "emp_dept": "Finance", joining_date: ISODate("2015-05-28T00:00:00.000Z"), "job_roles": [ "Analyst" , "Region Head"], salary : { "base": 3000, "variable": 500 }}
{ "_id": 6, "emp_dept": "Engineering", joining_date: ISODate("2015-01-29T18:30:00.000Z"), "job_roles": "Manager" , salary : { "base": 10000, "variable": 1500 }}
{ "_id": 7, "emp_dept": "Engineering", joining_date: ISODate("2014-05-02T18:30:00.000Z"), "job_roles": [ "Superintendent"], salary : { "base": 5000, "variable": 500 } }
{ "_id": 8, "emp_dept": "Engineering", joining_date: ISODate("2015-05-26T18:30:00.000Z"), "job_roles": [ "Superintendent" , "Region Head"], salary : { "base": 15000, "variable": 2000 } }

 

I will take the sql query below, which has most of the clauses we generally use and then see how the mongoDB aggregation looks like for the same.

— Query —
 


Select emp_dept,
sum(salarybase) as basesalary, sum(salaryvariable) as variablesalary,
count(_id) as empcount, avg(salary) as avgsalary
from employees
where joining_date >= '2015-01-01'
group by emp_dept
having avgsalary >= 2500
order by avgsalary desc;

 

— Aggregate function —
 


db.employees.aggregate( [
   { $match: { joining_date: { $gte: new Date("2015","00","00") }}},
   { $group: { _id: "$emp_dept", 
               basesalary: { $sum: "$salary.base" }, 
               variablesalary: {$sum: "$salary.variable"},
               empcount :{$sum:1}
              }},
   { $project: { 
               basesalary : "$basesalary",
               variablesalary: "$variablesalary",
               empcount :"$empcount",
               avgsalary: {$divide:[{ $add: [ "$basesalary", "$variablesalary" ] }, "$empcount" ] }
               }},
   { $match: { avgsalary: { $gte: 2500 } } }
] )

 

— Stages explained —
 

$match :

This is typically a filter expression. The matching documents are passed to the next stage with no modifications. In our example, it acts as a WHERE as well as HAVING clause, as per the position of the $match stage in the pipeline. Firstly, it was used to filter the whole collection to only retrieve employees with joining date greater than ‘2015-01-01’. Later, it was used to filter only the groups (departments) whose average salary exceeded 2500.

The typical operators used in Expressions are $gt, $lt, $eq, $gte, $lte etc. Apart from Operator Expressions, there are Boolean Expressions, Set Expressions, Comparison Expressions, Arithmetic Expressions, String Expressions and so on. A lot to explore and utilise.

$group :

This the group by clause which allows us to group the documents using a given key and define aggregated fields. In our example, we group the records on “employee department” and then calculate aggregated “basesalary”, “variablesalary” and also the “count of employees” in the group.

Apart from $sum, there are many other Accumulators that can be used in the $group stage, like $avg,$first,$last,$min,$max etc.

$project:

This is the SELECT clause, wherein we define the fields that we would like to see in the resultset.In our example, we have copied some fields from the previous stage results (basesalary, variablesalary,empcount) and added a new field “avgsalary” which uses arithmetic functions like add & divide.

The other Arithmetic Expressions we can use are $multiply, $subtract, $mod. But as described earlier, we have quite some expressions at our disposal.

Apart from the ones described,there are other stages that we haven’t talked about like $unwind, $skip, $limit, $redact, $out, $geoNear.
 
But I hope, I could at the least get you started on using the Aggregation Pipeline.

Shraddha Tambe | Helical IT Solutions

References:
http://docs.mongodb.org/manual/meta/aggregation-quick-reference/
http://docs.mongodb.org/v2.6/reference/sql-aggregation-comparison/

Mongo DB Aggregation – Part I – Single purpose aggregation operations

Mongo DB Aggregation(Part 1) – Single purpose aggregation operations

MongoDB offers more than one ways of aggregating data stored in a collection. It has a flexible datamodel and gives us tools to write some pretty complex queries and derive results similar to what SQL queries could do. Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.

There are 3 main aggregation operators :

  • Single purpose aggregation operations
  • Aggregation Framework
  • Map Reduce

 
In this part, we will have a look at the “Single purpose aggregation operators” and their usage with examples. Though these are limited in scope, they might provide simple way of doing the common data processing operations.

mongo_agg_simple

We will consider below collection as an example for reference:

{ "_id": 1, "emp_dept": "Sales", joining_date: "01-10-2014", "job_roles": [ "Manager", "Regional Head" ], salary : { "base": 2100, "variable": 800 }}
{ "_id": 2, "emp_dept": "Sales", joining_date: "03-01-2014", "job_roles": [ "Analyst", "Sales Officer"], salary : { "base": 2000, "variable": 500 }}
{ "_id": 3, "emp_dept": "Finance", joining_date: "31-01-2015", "job_roles": "Manager" , salary : { "base": 1000, "variable": 500 }}
{ "_id": 4, "emp_dept": "Sales", joining_date: "03-03-2015", "job_roles": [ "Analyst"], salary : { "base": 1500, "variable": 500 } }
{ "_id": 5, "emp_dept": "Finance", joining_date: "28-05-2015", "job_roles": [ "Analyst" , "Regional Head"], salary : { "base": 3000, "variable": 500 } }

 

To help easy understanding, we will first write an SQL query for each example and then translate to a MongoDB command.

1. COUNT – counts all documents or the ones that match a query (if specified)

Syntax:

db.<collection.count(query)
or
db.runCommand({ count: , query: , limit: , skip: , hint: })

 
Examples:

Query ==>

select count(*) from employees where joining_date >= '01-01-2015';

 

In Mongo ==>

db.employees.count({ joining_date: { $gte: new Date("2014","12","31") }})
or
db.runCommand( { count:'employees',
query: { joining_date: { $gte: new Date("2014","12","31") } }
} )

 

Result ==>

{ "n" : 3, "ok" : 1 }

* hint option is like a performance hint, used to specify a index that can be used.
 

2. DISTINCT – returns distinct values for a key in collection. In case the key is an array field, it considers each value in the array as different value.

Syntax:

db..distinct(field, query)
or
db.runCommand({ distinct: "", key: "", query: })

 

Examples:

Query 1 ===>

Select emp_dept from employees;

 

In Mongo ===>

db.runCommand({ distinct: "employees", key: "emp_dept"})
or db.employees.distinct("emp_dept")

 
Result ==>

{
"values" : [ "Sales", "Finance" ],
"stats" : { ... },
"ok" : 1
}

 

Query 2 ===>

Select distinct job_roles from employees where emp_dept = 'Sales';

 

In Mongo ===>

db.runCommand ( { distinct: "employees", key: "job_roles", query: { emp_dept: "Sales"} } )

 

Result ===>

{
"values" : [ "Manager", "Regional Head", "Analyst", "Sales Officer"],
"stats" : { ... },
"ok" : 1
}

 

3. GROUP – groups the documents in a collection by given key and allows to perform simple aggreation functions like sum/count on fields.

Syntax:

db.collection.group({ key, reduce, initial [, keyf] [, cond] [, finalize] })
or
db.runCommand({
group:
{
ns: ,
key: ,
$reduce: ,
initial:,
$keyf: ,
cond: ,
finalize:
}
})

 

Examples:

Query ===>

Select emp_dept,
sum(salarybase) as basesalary, sum(salaryvariable) as variablesalary,
count(_id) as empcount, avg(salary) as avgsalary
from employees
group by emp_dept;

 

In Mongo ===>

db.runCommand({ group:{
    ns: "employees",
    key:{emp_dept:1},
    initial: { basesalary : 0, variablesalary: 0, empcount:0 },
    $reduce: function ( curr, result ) {
    result.emp_dept = curr.emp_dept,
    result.basesalary += curr.salary.base;
    result.variablesalary += curr.salary.variable;
    result.empcount++;
    },
    $finalize:function(result){
    result.avgsalary = (result.basesalary+result.variablesalary)/result.empcount;
    }}})

or 

db.employees.group({key:{emp_dept:1},
    initial: { basesalary : 0, variablesalary: 0,empcount:0 }, reduce:function ( curr, result ) {
    result.emp_dept = curr.emp_dept,
    result.basesalary += curr.salary.base;
    result.variablesalary += curr.salary.variable;
    result.empcount++;
},finalize:function(result){
result.avgsalary = (result.basesalary+result.variablesalary)/result.empcount;
}})

 

Result ===>

{"retval" :[
{ "emp_dept" : "Sales", "basesalary" : 5600 ,"variablesalary": 1800, "empcount": 3, "avgsalary": 2466.666666666667},
{ "emp_dept" : "Finance", "basesalary" : 4000 ,"variablesalary": 1000, "empcount":2,"avgsalary": 2500},
],
"count" : 5,
"keys" : 3,
"ok" : 1}

 

This Group function / command does not work with sharded clusters and also has limitation on the size of the array that is returned. The aggregation framework or map reduce provide more stronger options.

So, this is it. Simple but handy.

We will look at the Aggregation Framework in Part II of this blog. Stay tuned !!

Shraddha Tambe | Helical IT Solutions

References:
http://docs.mongodb.org/manual/core/single-purpose-aggregation/
http://docs.mongodb.org/manual/reference/method/js-collection/

Map Reduce In MongoDB

MongoDb Map Reduce

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results.

For map-reduce operations, MongoDB provides the map Reduce database command.

The mapReduce command allows you to run map-reduce aggregation operations over a collection. The mapReduce command has the following prototype form:


db.runCommand(
     {
               mapReduce: <collection>,
               map: <function>,
               reduce: <function>,
               finalize: <function>,
               out: <output>,
               query: <document>,
               sort: <document>,
               limit: <number>,
               scope: <document>,
               verbose: <boolean>
     }
)

 

Pass the name of the collection to the mapReduce command (i.e. <collection>) to use as the source documents to perform the map reduce operation.

The command also accepts the following parameters:

Field Description
mapReduce The name of the collection on which you want to perform map-reduce. This collection will be filtered using query before being processed by the map function.
map A JavaScript function that associates or “maps” a value with a key and emits the key and value pair.
reduce A JavaScript function that “reduces” to a single object all the values associated with a particular key.
out Specifies where to output the result of the map-reduce operation. You can either output to a collection or return the result inline.
query Optional. Specifies the selection criteria using query operators for determining the documents input to the map function.
sort Optional. Sorts the input documents. This option is useful for optimization. For example, specify the sort key to be the same as the emit key so that there are fewer reduce operations. The sort key must be in an existing index for this collection.
limit Optional. Specifies a maximum number of documents for the input into the map function.
finalize Optional. Follows the reduce method and modifies the output.
scope Optional. Specifies global variables that are accessible in the map, reduce and finalize functions.
verbose Optional. Specifies whether to include the timing information in the result information. The verbose defaults to true to include the timing information.

 

The following is a prototype usage of the mapReduce command:

var mapFunction = function() { ... };
var reduceFunction = function(key, values) { ... };
db.runCommand(
{
      mapReduce: <input-collection>,
      map: mapFunction,
      reduce: reduceFunction,
      out: { merge: <output-collection> },
      query: <query>
}
)

Requirement for map function:
Map function is responsible for transforming each input document in to zero or more documents.It can access the variables defined in the scope parameter,and has following prototypes.

function(){
     ...
     emit(key,value);
}

The map function has the following requirements:

  • In the map function, reference the current document as this within the function.
  • The map function should not access the database for any reason.
  • The map function should be pure, or have no impact outside of the function (i.e. side effects.)
  • A single emit can only hold half of MongoDB..
  • The map function may optionally call emit(key,value) any number of times to create an output document associating key with value.

The following map function will call emit(key,value) either 0 or 1 times depending on the value of the input document’s status field:

function(){   
   if(this.status=='A')       
      emit(this.cust_id,1);
}

The following map function may call emit(key,value) multiple times depending on the number of elements in the input document’s items field:

function(){this.items.forEach(function(item){emit(item.sku,1);});}

Requirements for the Reduce Function
The reduce function has the following prototype:


     function(key,values){
         ...
         return result;
}

The reduce function exhibits the following behaviors:

  • The reduce function should not access the database, even to perform read operations.
  • The reduce function should not affect the outside system.
  • MongoDB will not call the reduce function for a key that has only a single value. The valuesargument is an array whose elements are the value objects that are “mapped” to the key.
  • MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
  • The reduce function can access the variables defined in the scope parameter.
  • The inputs to reduce must not be larger than half of MongoDB’s. This requirement may be violated when large documents are returned and then joined together in subsequent reduce steps.

Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:

  • the type of the return object must be identical to the type of the value emitted by the mapfunction.
  • the reduce function must be associative. The following statement must be true:
reduce(key,[C,reduce(key,[A,B])])==reduce(key,[C,A,B])
    • the reduce function must be idempotent. Ensure that the following statement is true:
reduce(key,[reduce(key,valuesArray)])==reduce(key,valuesArray)
    • the reduce function should be commutative: that is, the order of the elements in thevaluesArray should not affect the output of the reduce function, so that the following statement is true:
reduce(key,[A,B])==reduce(key,[B,A])

Requirements for the finalize Function

The finalize function has the following prototype:

 function(key,reducedValue){
          ...
          return modifiedObject;
}

The finalize function receives as its arguments a key value and the reducedValue from thereduce function. Be aware that:

  • The finalize function should not access the database for any reason.
  • The finalize function should be pure, or have no impact outside of the function (i.e. side effects.)
  • The finalize function can access the variables defined in the scope parameter.

out Options

You can specify the following options for the out parameter:

Output to a Collection

This option outputs to a new collection, and is not available on secondary members of replica sets.

out:<collectionName>

Map-Reduce Examples:

Consider two Collection (tables) named :

  • Employee
  • Department

Now , to create collection in mongo db , use below query

db.createCollection(“Employee”)
db.createCollection(“Department”)

To insert data in Employee Collection :

db.Employee.insert({“name” : { “first” : “John”, “last” : “Backus” }, “city” : “Hyd”,“department” : 1})

db.Employee.insert({“name” : { “first” : “Merry”, “last” : “Desuja” }, “city” : “Pune”,“department” : 2})

To insert data in Department Collection :

db.Department.insert({“_id” : 1,   “department” : “Manager”})

db.Department.insert({“_id” : 2,   “department” : “Accountant”})

Now the requirement is to display FirstName , LastName , DepartmentName.

For this , we need to use Map Reduce :

Create two map functions for both the collections.

//map function for Employee
var mapEmployee = function () {
var output= {departmentid : this.department,firstname:this.name.first, lastname:this.name.last , department:null}
     emit(this.department, output);               
};

//map function for Department
var mapDepartment = function () {
var output= {departmentid : this._id,firstname:null, lastname:null , department:this.department}
     emit(this._id, output);              
 };

Write Reduce Logic to display the required fields :



var reduceF = function(key, values) {

var outs = {firstname:null, lastname:null , department:null};

values.forEach(function(v){
      if(outs.firstname ==null){outs.firstname = v.firstname }                   
      if(outs.lastname ==null){outs.lastname = v.lastname    }
      if(outs.department ==null){ outs.department = v.department }                         
 });   
 return outs;
};

Store the result into a different collection called emp_dept_test


result = db.employee_test.mapReduce(mapEmployee, reduceF, {out: {reduce: ‘emp_dept_test’}}) 
result = db.department_test.mapReduce(mapDepartment,reduceF, {out: {reduce: ‘emp_dept_test’}})

write the following command to get combined result:

db.emp_dept_test.find()

Output of the query gives the combined result like


{
    "_id" : 1,
    "value" : {
        "firstname" : "John",
        "lastname" : "Backus",
        "department" : "Manager"
    }
}

/* 1 */
{
    "_id" : 2,
    "value" : {
        "firstname" : "Merry",
        "lastname" : "Desuja",
        "department" : "Accountant"
    }
}

 

-By
Nitin Uttarwar
Helical It Solution

Getting Started with Mongo DB

Installation & Startup:

Download MongoDB installer for windows platform from http://www.mongodb.org/downloads and run. This simply extracts the binaries to your program files.

#Create DBPATH and log libraries:

Allocate a folder in your system that can be used for holding the mongo databases and also allocate a log file.

Ex – Allocated “C:\mongo\data\db” for databases and “C:\mongo\logs\mongo.log” as a log file.

#Starting the mongo database

Below are different ways of starting the mongodb:

1.    From the command prompt

Execute the mongod.exe present in the bin folder to start the database.

On command prompt à mongod --dbpath c:\mongo\data\db

There are other options that can also be specified alongwith dbpath. If dbpath is not provided, it looks for c:\data\db folder and gives error if not found.

To shutdown, press CTRL+C

 

2.    Starting with a config file

You can create a configuration file to define settings for the MongoDB server like the dbpath,logpath etc. Below is a sample file :

(This is a older format, for 2.6 version a new format is introduced. Older format is supported for backward compatibility)

#This is an example config file for MongoDB

dbpath = C:\Mongo\data\db

port = 27017

logpath = C:\Mongo\logs\mongo.log

Now you can use the below command –

C:\Program Files\MongoDB 2.6 Standard\bin>mongod --config mongo.conf

2014-04-15T10:27:18.883+0530 log file "C:\Mongo\logs\mongo.log" exists; moved to

"C:\Mongo\logs\mongo.log.2014-04-15T04-57-18".

As we haven’t specified “logappend” option in the config file, it allocates new file everytime you start the db. You can check the log file if you are getting errors while connecting to the db

To shutdown, use command “mongod –shutdown”

 

3.    Installing as Windows service:

Start the command prompt as administrator

You can use the below command to create the service, edit the same as per your settings:

sc create MongoDB binPath= "\"C:\Program Files\MongoDB 2.6 Standard\bin\mongod.exe\" --service --config=\"C:\Program Files\MongoDB 2.6 Standard\bin\mongo.conf\"" DisplayName= "MongoDB 2.6 Standard"

Please note this is a single line of command

You can now simply start/stop the service to start/shutdown the mongo database.

 

Using Mongo command shell:

Run Mongo.exe from \bin folder and you will see the below:

MongoDB shell version: 2.6.0

connecting to: test      //This is the Default database

Welcome to the MongoDB shell.

For interactive help, type "help".

For more comprehensive documentation, see

http://docs.mongodb.org/

Questions? Try the support group

http://groups.google.com/group/mongodb-user

 

Some basic commands to get you started

> show dbs                  // show databases

admin  (empty)
local  0.078GB

> use names               // switch to a particular database/creates one if it does not exist

switched to db names

> db.mynames.insert({name: 'shraddha', email: 'shraddha@gmail'})           // Inserting document
WriteResult({ "nInserted" : 1 })

//Note that , ‘db’ points to the current database in use. Here, Collection “mynames” is automatically created when you insert a document

> show dbs

admin  (empty)
local  0.078GB
names  0.078GB

> db.mynames.find()               //query the db, select operation

{ "_id" : ObjectId("534cbfd03dfb3fbd86d8029d"), "name" : "shraddha", "email" : "shraddha@gmail" }

//One more way of inserting……

> a={"name":"test3","email":"test3.other"}

{ "name" : "test3", "email" : "test3.other" }

> b={"name":"test4",email:"test4.other"}

{ "name" : "test4", "email" : "test4.other" }

> db.othernames.insert(a)

WriteResult({ "nInserted" : 1 })

> db.othernames.insert(b)

WriteResult({ "nInserted" : 1 })

> db.othernames.insert(c)

2014-04-15T19:40:24.798+0530 ReferenceError: c is not defined

//…In all the above inserts, the “_id” which has the unique key is auto-generated..

 

> coll=db.mynames

names.mynames

> coll.find()

{ "_id" : ObjectId("534cbfd03dfb3fbd86d8029d"), "name" : "shraddha", "email" : "shraddha@gmail" }
{ "_id" : ObjectId("534d3b89f4d4b90697c205d6"), "name" : "test1", "email" : "test1.helical" }

> coll=db.othernames

names.othernames

> coll.find()

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }
{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

 

> coll.find({name:{$gt:"test3"}})                  //find documents where “name” is >”test3”

{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

> coll.find({name:"test3"})

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }

>

> coll.find({$or:[{name:{$gt:"test3"}},{name:"test3"}]})

{ "_id" : ObjectId("534d3dc3f4d4b90697c205d7"), "name" : "test3", "email" : "test3.other" }
{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

> coll.find({$or:[{name:{$gt:"test3"}},{name:"test0"}]})

{ "_id" : ObjectId("534d3dcdf4d4b90697c205d8"), "name" : "test4", "email" : "test4.other" }

>
 
//Example - Manually inserting ObjectID field (key value)
 
> coll=db.testobjs

names.testobjs

> coll.insert({_id:1,fld1:"abc",fld2:123})

WriteResult({ "nInserted" : 1 })

> coll.insert({_id:2,fld1:"cde",fld2:345})

WriteResult({ "nInserted" : 1 })

> coll.insert({_id:2,fld1:"cde",fld2:345})       //trying to insert duplicate value in _id

WriteResult({
"Inserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: names.testobjs.$_id_  dup key: { : 2.0 }"}}

> coll.find()

{ "_id" : 1, "fld1" : "abc", "fld2" : 123 }
{ "_id" : 2, "fld1" : "cde", "fld2" : 345 }

>
 

Importing a csv file into mongodb:

Alter the below command as per your requirement and execute:

C:\Program Files\MongoDB 2.6 Standard\bin>mongoimport --db northwind --collection orders --type csv --file C:\Shraddha\Official\MongoDB\northwind-mongo-master\orders.csv --headerline

connected to: 127.0.0.1
2014-04-17T18:24:22.603+0530 check 9 831
2014-04-17T18:24:22.604+0530 imported 830 objects

 

Options used –

–db : name of the database
–collection : orders
–type : type of input file (we can also import tsv, JSON)
–file : path of the input file
–headerline : signifies that the first line in the csv file is column names

 

Shraddha Tambe

Helical IT Solutions