Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Mondrian Cache Refresh Automation Mechanism in Pentaho BA Server

REFRESH MONDRIAN CACHE ON THE BA SERVER

One of the strengths of mondrian’s design is that you don’t need to do any processing to populate special data structures before you start running OLAP queries. More than a few people have observed that this makes mondrian an excellent choice for ‘real-time OLAP’ — running multi-dimensional queries on a database which is constantly changing.

The Cache stores data in byte code format in memory which helps in great performance enhancements. However, when you have a cube with (near-)real-time data, the caching can get in your way. If, for instance, your data is updated every ten minutes, but the data from an hour ago is still cached, your dashboard won’t show your new data.

There are two steps to clear the cache from the server.

Step 1: After you login to the BA Server, Go to Tools –> Refresh. There you can manually clear the cache.

Step 2: There is another way of clearing the Mondrian Cache is by using a Kettle job to clear your cache after you complete the insertion/updating of the new data. First we have to create a transformation. Inside the transformation, using the HTTP Client Step we can clear the cache through the ETL(Before that just use a Generate Rows step since the HTTP step works if there is a step before it) by passing the following URL inside the step “localhost:8080/pentaho/api/system/refresh/modrianSchemaCache” and also the user credentials. It clears the Mondrian cache every time the ETL runs on the server. It clears cache for all the cubes that are present in the server so that it is ready for the fresh data and we wont face any cache related issue which has always been a headache for real time data.

Similarly, we can also use this mechanism to clear the cache for reporting metadata and also for the repository cache. We just have to follow the same procedure in this also.

 

Regards,

Nitish Kumar Mishra

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Connection Set-Up

Connection Setup for connection type: PostgreSQL

[We are now setting up a connection to A database i.e if your source is a database]
There are 3 ways of access provided for Connections using PostgreSQL
a. Native (JBDC)
b. ODBC
c. JNDI

a. Native (JBDC)
I. Enter the Connection Name: Anyname
II. Select the connection type: PostgreSQL
III. Host Name: localhost [This can be an ip address]
Database Name: LearnETL[Name of the database you are using]
Port Number: 5432 or your required Port Number
User Name: Your database user name
Password: Your database password
IV. Test Connections and OK.

b. JNDI
Here we need to go to the data-integration folder and open up the sub-folder “simple-jndi” and edit jdbc.properties

Here we need to write the following code:

ETL_Connection/type=javax.sql.DataSource
ETL_Connection/driver=org.postgresql.Driver
ETL_Connection/url=jdbc:postgresql://localhost:5432/LearnETL
ETL_Connection/user=postgres
ETL_Connection/password=postgres

ETL_Connection: name of the connection
localhost:5432/LearnETL: localhost is the host name, 5432 is the port number and LearnETL is the Database name.
user: username
password: Password

Save and back to the Database connection
Restart your PDI.

and in the Setup, select JNDI and for
JNDI Name : name of your connection [ETL_Connection]

c. ODBC

This is not commonly used but what will be needed

are as follows:

1.Install the PostgreSQL ODBC driver which can be downloaded.
2.Select the PostgreSQL ODBC Unicode and then
3. Setup
enter Data source Source, Name, Server, Username, Password and Port. Test and Save if Connection is OK.!
4. a bundle of JAR files to be copied in your Java folder as this ODBC bundle files has been discontinued in JAVA 8 Bridge

Thank You
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Pentaho BA Server 6.0 : How to enable Pentaho Marketplace and Pentaho CDE(CTools) in the Enterprise Edition

Enable Pentaho Marketplace and Pentaho C-Tools in Pentaho BA Server 6.0 EE

Pentaho 6.0 is the first version of the Pentaho enterprise-class server with new features to automate, manage and enhance every stage of the data pipeline. Pentaho 6.0 also introduces inline analytic model editing for sharing user-created measures and metrics in a collaborative manner.

1. Installing Pentaho Marketplace Plugin:

The Pentaho BA Server Enterprise Edition provides us all those things that are limited in Community Edition. However, we can still add the community plugins of the Pentaho server using the Marketplace in Pentaho. The Marketplace is generally disabled is Pentaho 6.0 EE server. There is no need to download to the Marketplace all over again. It’s an OSGi plugin and you already have it, but it comes disabled by default. Here are the instructions on how to enable it:

    • Edit the file
      pentaho-solutions/system/karaf/etc/org.apache.karaf.features.cfg
    • Find the key “featuresBoot” and add the feature “pentaho-marketplace”. That line will then be something like:
      featuresBoot=config,management,kar,cxf,camel,camel-blueprint,camel-stream,pentahoserver,mongolap,pentaho- metaverse,pdi-dataservice-client,pentaho-yarn-obf,pentaho-marketplace

      .

    • Restart the server and its done. The marketplace will be visible. You find it in dropdown of the Home Button. Please see the below screenshot for your reference:

Marketplace

2. Installing Pentaho CDE (Community Dashboards):

In enterprise edition, Pentaho has its own Dashboard but it has limited functionality as we cannot perform the tasks that we can do on the CDE. Therefore, developers always opt for CDE which is available to download from Pentaho Marketplace. If you go into Marketplace, follow the below steps,

      • There you would find CDF(Community Dashboard Framework), CDA(Community Data Access) and CDE(Community Dashboard Editor).
      • Download all of them.
      • Restart the server and it’s done.

Please refer to the below screenshot:

CDE

Thank you,

Nitish Kumar Mishra

WITH clause in MDX

WITH clause in MDX

In MDX queries, the WITH clause has an important role to play. WITH clauses are used to define some members which will be used in the SELECT clause for analysis purposes. The WITH clause contains either delarations of multiple sets and multiple members. Also, it is used to perform certain calculations like calculating profits,calculating prev year’s information,etc.

Below are some examples of the MDX queries using the WITH clause,

1. MDX Query to calculate the Total Sales for each year:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’
SELECT NON EMPTY {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
NON EMPTY [Time].[Years].members ON ROWS
FROM [SteelWheelsSales]

 

In the above example, the WITH clause was used to calculate the total sales value by multiplying the measures “Sales” and “Quantity”. The WITH clause creates a member [Measures].[Total Sales] (it can be used to create a new measure or a new dimension member to calculate based on the existing measures). After the WITH Clause, the member “Measures.Total Sales” is used as a measure in the SELECT clause and is displayed in the column axis whereas the dimensional data is displayed in the row axis.

Result:

With1

2. MDX Query to set a format string in a WITH Clause:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’,
format_string=’$#,##0.00′
SELECT {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
[Time].[Years].members ON ROWS
FROM [SteelWheelsSales]

In the same query, if we add a format_string keyword and specify the format in which you want to display the values in that particular member, then we would get the desired result in the output. Alternatively, you can use the name(for example, here its currency).
Result:
With2

3. MDX query to generate result using SET in WITH CLAUSE:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’, format_string = “$#,##0.00”
SET [Not 2005] as ‘Filter({[Time].[Years].Members}, ([Time].[Years].CurrentMember.Name <> “2005”))’
SELECT {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
[Not 2005] ON ROWS
FROM [SteelWheelsSales]

Now lets suppose you want to show all the data except for 2015. You can use the SET Function which will filter the data (as shown in the above example) and will show you the data for all except 2005.

Result:

With3

4. MDX Query to compare data for current year and prior year:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’
member [Measures].[Last Period] as ‘([Time].[Years].CurrentMember.PrevMember, [Measures].[Sales])’
SELECT {[Measures].[Sales],[Measures].[Last Period]} ON ROWS,
[Time].[Years].Members ON COLUMNS
FROM [SteelWheelsSales]

In the above expression, the member [Measures].[Last Period] is used to generate Sales information of the previous year. The CurrentMember.PrevMember function is used to generate the Previous Year information(you can do it similarly for any time member) based upon the Measure that you have specified. This is also done using WITH Clause.

Result:
With4

 

Conclusion:

The WITH Clause in MDX is very similar to the WITH Clause we use in SQL SERVER/Postgres. It helps us in understand the functionality better and is also used for faster processing. The WITH CACHE clause is used for optimization which helps in faster query execution.

 

 

Thanks,

Nitish Kumar Mishra

Get the Row Count in PDI Dynamically

Get the Row Count in PDI Dynamically

Often people use the data input component in pentaho with count(*) select query to get the row counts. And pass the row count value from the source query to the variable and use it in further transformations.The more optimised way to do so can be through the built in number of options available in the pentaho. One of such component available in the penatho is ‘Memory Group By’.

In this blog, I’ll discuss how we can use the ‘Memory Group By’ component available in the pentaho to count the rows.

Here is an example, Suppose you have a ready query to pull records from the database then what you can do is pass the record value coming from the input component to the memory group by component as shown below:

Memory_GRoup_by1

Get the ‘Memory Group by’ component from: Design->Statistics-> Memory Group by

Memory_GRoup_by2

After that pass the count records from ‘Memory Group by’ component to the ‘Set Variables’.
Click on the ‘Memory Group by’ and in ‘Aggregates’, write the name as ‘CR_count’
In subject put the column name on which the records should be grouped and give the final count.
In type, from the drop down list select the option shown below as ‘Number of rows (without field argument)’ to get the records count.

Memory_GRoup_by3

Pass the output from memory group by option to the Get Variables, there configure the variable with the same name as you wrote in ‘Memory Group by’ i.e ‘CR_count’ overhere we have used.

Memory_GRoup_by4

Finally, we can use the records count via variable throughout the job by accessing the variable.

Thanks,
Nisha Sahu

CROSSTABS in Pentaho Report Designer and few limitations — EXPERIMENTAL FEATURE


CROSSTABS in Pentaho Report Designer (EXPERIMENTAL FEATURE)

Crosstabs are easy and relatively easy way to visualize tabular data along two or more axis. Crosstabs are still experimental feature, and as such are not up to mark in BI reports. Alhough they are already used to create some really useful reports, they still have parts which are to be handled in query as it expects the normalized data.

1. Data has to be sorted to match the group structure.
2. NULL data must not be present.
3. Aggregations in cross tabs are not flexible.

1. Data has to be sorted :
Data has to be sorted in the query of the crosstab in the order followed Group, Rows, Columns. If they are not sorted it pops up an error saying data is unsorted and the reports are not generated.

2.NULL data must not be present:
No NULL values are allowed in the data. The consequent group data is moved to the present group which displays the wrong report with inappropriate data. So, one has to handle the NULL by replacing them with 0.

3. Aggregations in cross tabs are not flexible:
Aggregations as in the print function where we could either display max or min or sum etc.. wouldnt calculate the % and the functions has to be used in case of any special calculations. It cannot be handled in attributes with regular expressions.

Thanks
Asha Bidiyasar