How to Solve Duplicate Data Issues in Mondrian Hierarchy

Posted on by By Mounika, in Pentaho | 0

In a cube, sometimes we come across issues related to data like duplicate values appearing in the cube. This blog explains how to solve such kind of data issues.

Scenario 1:

The reason for the duplicate data is an improper ordering of levels in a hierarchy.

Consider the below cube which gives information about sales per each city, state and country.

The highlighted information shows that there is duplicate data.SQL Query can be used to test the data. Below is the query used to test:

SQL Query:
select l.country,l.state,l.city,sum(store_sales)
from fact_blog f
inner join location_blog l
on f.location_id=l.id
group by l.country,l.state,l.city
order by l.country,l.state,l.city

Query Result:

The data discrepancy is because of an improper ordering of levels in hierarchy.

Current Hierarchy:

  1. Country
  2. City
  3. State

Each country has different cities, but each city has the same state. Mondrian does not work with such kind of hierarchy. It aggregates the measured value for distinct state i.e. lowest level. So cube has the same data for all the 3 rows where the state is Maharashtra. It aggregated the values of all the rows where the state is Maharashtra(1030+10+1004=2044) and displayed it on the cube.

Suggested Hierarchy:

  1. Country
  2. State
  3. City

Now, each country has different state and also each state has a different city. So there would not be any duplicates. Below is the cube after reordering the levels in the hierarchy.




Scenario 2:

Consider the below cube which gives information about sales per each city, state and country

If we observe there is the same amount of data i.e. 301 for city Hyderabad which is located in different countries namely India and Pakistan.

Below is the SQL query for the report generated from the cube

SQL Query:
select l.country,l.state,l.city,sum(store_sales)
from fact_blog f
inner join location_blog l
on f.location_id=l.id
group by l.country,l.state,l.city
order by l.country,l.state,l.city

Query Result:

The data is duplicated in the cube because there is the same city in both countries.

Mondrian does not work with such kind of hierarchy. It aggregates the measured value for distinct city i.e. the lowest level. So cube has the same data for the 2 rows where the city is Hyderabad. It aggregated the values of all the rows where the city is Hyderabad(300+1=301) and displayed it on the cube.

To resolve this problem, there are 3 ways

  1. Remove city ‘Hyderabad’ from the level and use a code for city which is unique.
  2. Show some difference in the city names i.e. Prefix the city name which is same for both states with different names(country/state names).Ex : I_Hyderabad, P_Hyderabad

Screenshots of the result are given below for every solution mentioned in the above 2 points.

  1. The report is as shown after removing ‘Pakistan’
  2. Query Result:

  3. Report after prefixing the city names with country/state names is as shown

In Mondrian cube, the lowest level must be unique. If not, data will be duplicated.

In case if you have any queries please get us at support@helicaltech.com

Thank You
Mounika
Helical IT Solutions Pvt Ltd

logo

Best Open Source Business Intelligence Software Helical Insight Here

logo

A Business Intelligence Framework


0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments