What is a Cube in Data warehousing?
- Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.
- There are different types of cubes, and we would be covering only multidimensional cube in this blog-post.
What is a Multidimensional Cube?
- A multidimensional cube will give us a view of the data for the required dimensions along with the configured aggregations.
- For example, please consider the below data table which gives us a one dimensional view.
- If we wish to view the data differently i.e., need to know the
paid to each then we need to aggregate the above data and capture the results.
- Aggregation of data on a reactive basis works fine for systems dealing with small amount of data.But, when we talk about big data then the aggregations are time taking (take forever) and resource consuming.
- As most the OLAP applications in the current IT world would wish to play around with as much data as possible, we need big data solutions in place to address any data response/processing delays.
- The solution described in this blog will pre-aggregate the data as required on the dimensions given using an open source ETL tool. This pre-aggregation will avoid data processing/response delays when dealing with BIG DATA.
Data Cube Using Talend ETL Tool :
- Basic knowledge of ETL and data-warehousing concepts
- Basic knowledge of Talend ETL Tool
Tool Used :
- Talend Open Source Studio for Big Data v6.4
We would be considering the below data file for creating the cube. The same logic can be applied for aggregating the data on a larger scale.
The aggregations that need to be applied on the data are:
- Average <Salary>
- Count of employees
Open the Talend application and create a new job.
From the palette drag and drop the ‘tFileInputDelimited’ component and configure it as per below.
Component Configuration :
- The number of the combinations of the cube depends on the number of dimensions required.
- The formula to calculate the number of combinations is 2(n)
- n=number of dimensions.
- We have 3 dimensions, so we will have a total of 8 combinations.
- Each dimension will have one aggregation based on dimension combination.
- Drag a tmap component from the pallete and connect the previous component output to tmap.
- Configure the tmap component such that it reflects the number of possible cube combinations, which in the current example is 8.
- We can calculate the number of possible cube combinations using the below link
- For the data we are processing, the number of possible cube combinations are as follows
- There should be 8 outputs mapped in the tmap as per below.
Combination 1 --> <Name>,<Department>,<Salary>,<Aggregation1>,<Aggregation2> Combination 2 --><Name>,<Department>,NULL,<Aggregation1>,<Aggregation2> Combination 3 --><Name>,NULL,<Salary>,<Aggregation1>,<Aggregation2> Combination 4 -->NULL,<Department>,<Salary>,<Aggregation1>,<Aggregation> Combination 5 --><Name>,NULL,NULL,<Aggregation1>,<Aggregation2> Combination 6 -->NULL,<Department>,NULL,<Aggregation1>,<Aggregation2> Combination 7 -->NULL,NULL,<Salary>,<Aggregation1><,Aggregation2> Combination 8 -->NULL,NULL,NULL,<Aggregation1>,<Aggregation2>
- Output each values of combinations i.e., Cube1,2,3 etc to different taggregaterow components.
- Each taggregaterow component will have the same aggregation with different dimensions as per below.
Redirect the output of the to a tlogrow component(can be redirected to a file or different database) and read the cube data.
Columns separated by “|”.
Complete Design of the job in Talend
- Above design is one of the solutions for pre-aggregating big data and making it readily available querying.
- Like everything else, this solution comes with a trade-off i.e., storage space.
- The pre-aggregated cube data can be stored in a clustered file system with Apache cassandra on top of it for faster data retrieval.
In case if you have any queries comment in the below comment section
Senior ETL Developer
Helical IT Solutions Pvt Ltd
Best Open Source Business Intelligence Software Helical Insight Here
A Business Intelligence Framework
Best Open Source Business Intelligence Software Helical Insight is Here