Mongo DB Aggregation – Part II – The Aggregation Framework

Mongo DB Aggregation(Part 2) – The Aggregation Framework
 

The Mongo DB aggregation framework or aggregation pipeline lets you define a data processing pipeline. It takes multiple documents in a collection, puts them through the multiple Stages we define and transforms them into an aggregated result. It allows us to do much more complex aggregation than the “single purpose aggregation operators”, that we saw in the first part of this blog (Part I).

Let’s dive into an example to see what we can do with this. Let’s take the same collection we used in the part I with some additional documents.


{ "_id": 1, "emp_dept": "Sales", joining_date: ISODate("2014-10-01T00:00:00.000Z"), "job_roles":[ "Manager", "Region Head" ], salary :{ "base": 2100, "variable": "800" }}
{ "_id": 2, "emp_dept": "Sales", joining_date: ISODate("2014-01-03T00:00:00.000Z"), "job_roles": [ "Analyst", "Sales Officer"], salary : { "base": 2000, "variable": 500 }}
{ "_id": 3, "emp_dept": "Finance", joining_date: ISODate("2015-01-31T00:00:00.000Z"), "job_roles": "Manager" , salary : { "base": 1000, "variable": 500 }}
{ "_id": 4, "emp_dept": "Sales", joining_date: ISODate("2015-03-03T00:00:00.000Z"), "job_roles": [ "Analyst"], salary : { "base": 1500, "variable": 500 }}
{ "_id": 5, "emp_dept": "Finance", joining_date: ISODate("2015-05-28T00:00:00.000Z"), "job_roles": [ "Analyst" , "Region Head"], salary : { "base": 3000, "variable": 500 }}
{ "_id": 6, "emp_dept": "Engineering", joining_date: ISODate("2015-01-29T18:30:00.000Z"), "job_roles": "Manager" , salary : { "base": 10000, "variable": 1500 }}
{ "_id": 7, "emp_dept": "Engineering", joining_date: ISODate("2014-05-02T18:30:00.000Z"), "job_roles": [ "Superintendent"], salary : { "base": 5000, "variable": 500 } }
{ "_id": 8, "emp_dept": "Engineering", joining_date: ISODate("2015-05-26T18:30:00.000Z"), "job_roles": [ "Superintendent" , "Region Head"], salary : { "base": 15000, "variable": 2000 } }

 

I will take the sql query below, which has most of the clauses we generally use and then see how the mongoDB aggregation looks like for the same.

— Query —
 


Select emp_dept,
sum(salarybase) as basesalary, sum(salaryvariable) as variablesalary,
count(_id) as empcount, avg(salary) as avgsalary
from employees
where joining_date >= '2015-01-01'
group by emp_dept
having avgsalary >= 2500
order by avgsalary desc;

 

— Aggregate function —
 


db.employees.aggregate( [
   { $match: { joining_date: { $gte: new Date("2015","00","00") }}},
   { $group: { _id: "$emp_dept", 
               basesalary: { $sum: "$salary.base" }, 
               variablesalary: {$sum: "$salary.variable"},
               empcount :{$sum:1}
              }},
   { $project: { 
               basesalary : "$basesalary",
               variablesalary: "$variablesalary",
               empcount :"$empcount",
               avgsalary: {$divide:[{ $add: [ "$basesalary", "$variablesalary" ] }, "$empcount" ] }
               }},
   { $match: { avgsalary: { $gte: 2500 } } }
] )

 

— Stages explained —
 

$match :

This is typically a filter expression. The matching documents are passed to the next stage with no modifications. In our example, it acts as a WHERE as well as HAVING clause, as per the position of the $match stage in the pipeline. Firstly, it was used to filter the whole collection to only retrieve employees with joining date greater than ‘2015-01-01’. Later, it was used to filter only the groups (departments) whose average salary exceeded 2500.

The typical operators used in Expressions are $gt, $lt, $eq, $gte, $lte etc. Apart from Operator Expressions, there are Boolean Expressions, Set Expressions, Comparison Expressions, Arithmetic Expressions, String Expressions and so on. A lot to explore and utilise.

$group :

This the group by clause which allows us to group the documents using a given key and define aggregated fields. In our example, we group the records on “employee department” and then calculate aggregated “basesalary”, “variablesalary” and also the “count of employees” in the group.

Apart from $sum, there are many other Accumulators that can be used in the $group stage, like $avg,$first,$last,$min,$max etc.

$project:

This is the SELECT clause, wherein we define the fields that we would like to see in the resultset.In our example, we have copied some fields from the previous stage results (basesalary, variablesalary,empcount) and added a new field “avgsalary” which uses arithmetic functions like add & divide.

The other Arithmetic Expressions we can use are $multiply, $subtract, $mod. But as described earlier, we have quite some expressions at our disposal.

Apart from the ones described,there are other stages that we haven’t talked about like $unwind, $skip, $limit, $redact, $out, $geoNear.
 
But I hope, I could at the least get you started on using the Aggregation Pipeline.

Shraddha Tambe | Helical IT Solutions

References:
http://docs.mongodb.org/manual/meta/aggregation-quick-reference/
http://docs.mongodb.org/v2.6/reference/sql-aggregation-comparison/

Mongo DB Aggregation – Part I – Single purpose aggregation operations

Mongo DB Aggregation(Part 1) – Single purpose aggregation operations

MongoDB offers more than one ways of aggregating data stored in a collection. It has a flexible datamodel and gives us tools to write some pretty complex queries and derive results similar to what SQL queries could do. Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.

There are 3 main aggregation operators :

  • Single purpose aggregation operations
  • Aggregation Framework
  • Map Reduce

 
In this part, we will have a look at the “Single purpose aggregation operators” and their usage with examples. Though these are limited in scope, they might provide simple way of doing the common data processing operations.

mongo_agg_simple

We will consider below collection as an example for reference:

{ "_id": 1, "emp_dept": "Sales", joining_date: "01-10-2014", "job_roles": [ "Manager", "Regional Head" ], salary : { "base": 2100, "variable": 800 }}
{ "_id": 2, "emp_dept": "Sales", joining_date: "03-01-2014", "job_roles": [ "Analyst", "Sales Officer"], salary : { "base": 2000, "variable": 500 }}
{ "_id": 3, "emp_dept": "Finance", joining_date: "31-01-2015", "job_roles": "Manager" , salary : { "base": 1000, "variable": 500 }}
{ "_id": 4, "emp_dept": "Sales", joining_date: "03-03-2015", "job_roles": [ "Analyst"], salary : { "base": 1500, "variable": 500 } }
{ "_id": 5, "emp_dept": "Finance", joining_date: "28-05-2015", "job_roles": [ "Analyst" , "Regional Head"], salary : { "base": 3000, "variable": 500 } }

 

To help easy understanding, we will first write an SQL query for each example and then translate to a MongoDB command.

1. COUNT – counts all documents or the ones that match a query (if specified)

Syntax:

db.<collection.count(query)
or
db.runCommand({ count: , query: , limit: , skip: , hint: })

 
Examples:

Query ==>

select count(*) from employees where joining_date >= '01-01-2015';

 

In Mongo ==>

db.employees.count({ joining_date: { $gte: new Date("2014","12","31") }})
or
db.runCommand( { count:'employees',
query: { joining_date: { $gte: new Date("2014","12","31") } }
} )

 

Result ==>

{ "n" : 3, "ok" : 1 }

* hint option is like a performance hint, used to specify a index that can be used.
 

2. DISTINCT – returns distinct values for a key in collection. In case the key is an array field, it considers each value in the array as different value.

Syntax:

db..distinct(field, query)
or
db.runCommand({ distinct: "", key: "", query: })

 

Examples:

Query 1 ===>

Select emp_dept from employees;

 

In Mongo ===>

db.runCommand({ distinct: "employees", key: "emp_dept"})
or db.employees.distinct("emp_dept")

 
Result ==>

{
"values" : [ "Sales", "Finance" ],
"stats" : { ... },
"ok" : 1
}

 

Query 2 ===>

Select distinct job_roles from employees where emp_dept = 'Sales';

 

In Mongo ===>

db.runCommand ( { distinct: "employees", key: "job_roles", query: { emp_dept: "Sales"} } )

 

Result ===>

{
"values" : [ "Manager", "Regional Head", "Analyst", "Sales Officer"],
"stats" : { ... },
"ok" : 1
}

 

3. GROUP – groups the documents in a collection by given key and allows to perform simple aggreation functions like sum/count on fields.

Syntax:

db.collection.group({ key, reduce, initial [, keyf] [, cond] [, finalize] })
or
db.runCommand({
group:
{
ns: ,
key: ,
$reduce: ,
initial:,
$keyf: ,
cond: ,
finalize:
}
})

 

Examples:

Query ===>

Select emp_dept,
sum(salarybase) as basesalary, sum(salaryvariable) as variablesalary,
count(_id) as empcount, avg(salary) as avgsalary
from employees
group by emp_dept;

 

In Mongo ===>

db.runCommand({ group:{
    ns: "employees",
    key:{emp_dept:1},
    initial: { basesalary : 0, variablesalary: 0, empcount:0 },
    $reduce: function ( curr, result ) {
    result.emp_dept = curr.emp_dept,
    result.basesalary += curr.salary.base;
    result.variablesalary += curr.salary.variable;
    result.empcount++;
    },
    $finalize:function(result){
    result.avgsalary = (result.basesalary+result.variablesalary)/result.empcount;
    }}})

or 

db.employees.group({key:{emp_dept:1},
    initial: { basesalary : 0, variablesalary: 0,empcount:0 }, reduce:function ( curr, result ) {
    result.emp_dept = curr.emp_dept,
    result.basesalary += curr.salary.base;
    result.variablesalary += curr.salary.variable;
    result.empcount++;
},finalize:function(result){
result.avgsalary = (result.basesalary+result.variablesalary)/result.empcount;
}})

 

Result ===>

{"retval" :[
{ "emp_dept" : "Sales", "basesalary" : 5600 ,"variablesalary": 1800, "empcount": 3, "avgsalary": 2466.666666666667},
{ "emp_dept" : "Finance", "basesalary" : 4000 ,"variablesalary": 1000, "empcount":2,"avgsalary": 2500},
],
"count" : 5,
"keys" : 3,
"ok" : 1}

 

This Group function / command does not work with sharded clusters and also has limitation on the size of the array that is returned. The aggregation framework or map reduce provide more stronger options.

So, this is it. Simple but handy.

We will look at the Aggregation Framework in Part II of this blog. Stay tuned !!

Shraddha Tambe | Helical IT Solutions

References:
http://docs.mongodb.org/manual/core/single-purpose-aggregation/
http://docs.mongodb.org/manual/reference/method/js-collection/