Mongo DB Aggregation(Part 2) – The Aggregation Framework
The Mongo DB aggregation framework or aggregation pipeline lets you define a data processing pipeline. It takes multiple documents in a collection, puts them through the multiple Stages we define and transforms them into an aggregated result. It allows us to do much more complex aggregation than the “single purpose aggregation operators”, that we saw in the first part of this blog (Part I).
Let’s dive into an example to see what we can do with this. Let’s take the same collection we used in the part I with some additional documents.
{ "_id": 1, "emp_dept": "Sales", joining_date: ISODate("2014-10-01T00:00:00.000Z"), "job_roles":[ "Manager", "Region Head" ], salary :{ "base": 2100, "variable": "800" }}
{ "_id": 2, "emp_dept": "Sales", joining_date: ISODate("2014-01-03T00:00:00.000Z"), "job_roles": [ "Analyst", "Sales Officer"], salary : { "base": 2000, "variable": 500 }}
{ "_id": 3, "emp_dept": "Finance", joining_date: ISODate("2015-01-31T00:00:00.000Z"), "job_roles": "Manager" , salary : { "base": 1000, "variable": 500 }}
{ "_id": 4, "emp_dept": "Sales", joining_date: ISODate("2015-03-03T00:00:00.000Z"), "job_roles": [ "Analyst"], salary : { "base": 1500, "variable": 500 }}
{ "_id": 5, "emp_dept": "Finance", joining_date: ISODate("2015-05-28T00:00:00.000Z"), "job_roles": [ "Analyst" , "Region Head"], salary : { "base": 3000, "variable": 500 }}
{ "_id": 6, "emp_dept": "Engineering", joining_date: ISODate("2015-01-29T18:30:00.000Z"), "job_roles": "Manager" , salary : { "base": 10000, "variable": 1500 }}
{ "_id": 7, "emp_dept": "Engineering", joining_date: ISODate("2014-05-02T18:30:00.000Z"), "job_roles": [ "Superintendent"], salary : { "base": 5000, "variable": 500 } }
{ "_id": 8, "emp_dept": "Engineering", joining_date: ISODate("2015-05-26T18:30:00.000Z"), "job_roles": [ "Superintendent" , "Region Head"], salary : { "base": 15000, "variable": 2000 } }
I will take the sql query below, which has most of the clauses we generally use and then see how the mongoDB aggregation looks like for the same.
— Query —
Select emp_dept, sum(salarybase) as basesalary, sum(salaryvariable) as variablesalary, count(_id) as empcount, avg(salary) as avgsalary from employees where joining_date >= '2015-01-01' group by emp_dept having avgsalary >= 2500 order by avgsalary desc;
— Aggregate function —
db.employees.aggregate( [ { $match: { joining_date: { $gte: new Date("2015","00","00") }}}, { $group: { _id: "$emp_dept", basesalary: { $sum: "$salary.base" }, variablesalary: {$sum: "$salary.variable"}, empcount :{$sum:1} }}, { $project: { basesalary : "$basesalary", variablesalary: "$variablesalary", empcount :"$empcount", avgsalary: {$divide:[{ $add: [ "$basesalary", "$variablesalary" ] }, "$empcount" ] } }}, { $match: { avgsalary: { $gte: 2500 } } } ] )
— Stages explained —
$match :
This is typically a filter expression. The matching documents are passed to the next stage with no modifications. In our example, it acts as a WHERE as well as HAVING clause, as per the position of the $match stage in the pipeline. Firstly, it was used to filter the whole collection to only retrieve employees with joining date greater than ‘2015-01-01’. Later, it was used to filter only the groups (departments) whose average salary exceeded 2500.
The typical operators used in Expressions are $gt, $lt, $eq, $gte, $lte etc. Apart from Operator Expressions, there are Boolean Expressions, Set Expressions, Comparison Expressions, Arithmetic Expressions, String Expressions and so on. A lot to explore and utilise.
$group :
This the group by clause which allows us to group the documents using a given key and define aggregated fields. In our example, we group the records on “employee department” and then calculate aggregated “basesalary”, “variablesalary” and also the “count of employees” in the group.
Apart from $sum, there are many other Accumulators that can be used in the $group stage, like $avg,$first,$last,$min,$max etc.
$project:
This is the SELECT clause, wherein we define the fields that we would like to see in the resultset.In our example, we have copied some fields from the previous stage results (basesalary, variablesalary,empcount) and added a new field “avgsalary” which uses arithmetic functions like add & divide.
The other Arithmetic Expressions we can use are $multiply, $subtract, $mod. But as described earlier, we have quite some expressions at our disposal.
Apart from the ones described,there are other stages that we haven’t talked about like $unwind, $skip, $limit, $redact, $out, $geoNear.
But I hope, I could at the least get you started on using the Aggregation Pipeline.
Shraddha Tambe | Helical IT Solutions
References:
http://docs.mongodb.org/manual/meta/aggregation-quick-reference/
http://docs.mongodb.org/v2.6/reference/sql-aggregation-comparison/
Best Open Source Business Intelligence Software Helical Insight is Here