Alternative approach of using Insert/Update Step for Upsert in PDI

What is an UPSERT?

UPSERT is a combination of two activities in a table i.e. Update and Insert based upon a unique key(Unique iD). A Relational database uses MERGE JOIN to perform UPSERT operation on data where it updates if there is any change on the existing data and INSERTS if there is a new data based upon certain conditions (identified as unique ID).

How UPSERT works in an ETL?

While loading the data from the source to target, there are two prominent approach implemented. Namely,

  • Truncate and load: Flush out the old data and reload the new data.
  • Insert/Update (Upsert): Update the existing record and Insert if there is a new data. It matches on a particular key or a set of keys to perform Update/Insert. If the keys doesn’t match, then it is an INSERT or else it will update. SCD performs in the same way(SCD-I is an UPSERT whereas SCD-II is an INSERT).

Truncate and load approach are not recommended in some scenarios where the source data only has current data. In case of maintaining historical information as well, we use UPSERT mechanism.

UPSERT in Pentaho Data Integration(PDI) :

There are various components that are used to implement UPSERT in PDI. These components use the unique key as the keys for lookup from the target table and specific operation is performed based upon the condition. The following are the components used for Upsert

  1. Insert/Update(Insert when there is no match and update if there is match)
  2. Dimensional Lookup/Update(This is used to SCD Operations)
  3. Synchronise after Merge(This step needs a flag to execute. Based upon the flags it will insert/update)

The basic algorithm/flow of how the UPSERT functions in PDI is as follows,

Capture1

Disadvantages on the above steps :

Now the components that are present with PDI and the algorithms designed to implement the UPSERT is very slow while handling large amounts of data.

  1. The Insert/Update step does a lookup on all the values in the table and does one to one comparison to either insert or update. If the amount of data is huge, then the component will work very slowly and hence the whole process slows down.
  2. Similar case happens on the dimensional lookup/update and also on the Synchronise after merge step as well. Since it does a lookup on the whole table i.e. it compares with each and every data on the table, it slows down the process.

Alternative approach for the UPSERT technique:

There are certain scenarios where the ETL runs on a regular basis let’s say, on every 12 hours. In that case, the components used to implement the UPSERT technique wont help since they are very slow and if there are huge amounts of data, then its performance will be very slow. So there is a need of optimization.  Optimizing an ETL process  is one of the most important aspects that we need to undertake since it helps in faster processing. The following diagram will help you understand the alternate approach for UPSERT mechanism,

Capture2

This steps mentioned in the above diagram works exactly like the Insert/Update component but with some minor tweaks. The following are the description of each step,

  1. Table Input: Taking input from the Source data that needs to be loaded into the Target.
  2. Add Sequence: The Sequence number that needs to be added as a business key to the table(This can change according to your requirement).
  3. Update: This component works only when there is a need for Update. If there is a new data, this component will fail since it only works when there is a need to update on the existing data.
  4. Table Output: PDI has a functionality that supports error handling of each component. The error handling component of the Update step will bring you to the Table Output component which will INSERT the new data into the target table.

Please see the below diagram for more better understanding,

Capture3

Conclusion:

This approach is faster than the usual Insert/update step since it doesn’t need to lookup if there is a new data. It will directly insert the new data when the Update component fails.

Try once and share your feedback!! 🙂

 

Regards,

Nitish Mishra

Different ways of implementing loop in PDI

Different ways of implementing loop in PDI


We come across certain scenarios where in running of the Job/Transformation in loop becomes necessary in Pentaho. Below are the different ways of implementing the loop to the Job in PDI.

 

1. Using Execute for every input row

Build a job, and let the first transformation retrieve all the rows via Table Input. Use a “Copy rows to result” step to save the result. Create a second transformation configured for single row execution – refer the image advanced transformation settings. Use “Get rows from result” to start single row processing.

blog_3

 

 

2. Use simple evaluation step

blog_1

  • Get the number of count for the loop

  • Assign the count to a variable.

  • Call the variable in the simple evaluation step( as shown in the figure below)

    blog_2

  • Give the appropriate condition according to the requirement.

  • Connect the hop to the job which has to run in a loop.

Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Mondrian Cache Refresh Automation Mechanism in Pentaho BA Server

REFRESH MONDRIAN CACHE ON THE BA SERVER

One of the strengths of mondrian’s design is that you don’t need to do any processing to populate special data structures before you start running OLAP queries. More than a few people have observed that this makes mondrian an excellent choice for ‘real-time OLAP’ — running multi-dimensional queries on a database which is constantly changing.

The Cache stores data in byte code format in memory which helps in great performance enhancements. However, when you have a cube with (near-)real-time data, the caching can get in your way. If, for instance, your data is updated every ten minutes, but the data from an hour ago is still cached, your dashboard won’t show your new data.

There are two steps to clear the cache from the server.

Step 1: After you login to the BA Server, Go to Tools –> Refresh. There you can manually clear the cache.

Step 2: There is another way of clearing the Mondrian Cache is by using a Kettle job to clear your cache after you complete the insertion/updating of the new data. First we have to create a transformation. Inside the transformation, using the HTTP Client Step we can clear the cache through the ETL(Before that just use a Generate Rows step since the HTTP step works if there is a step before it) by passing the following URL inside the step “localhost:8080/pentaho/api/system/refresh/modrianSchemaCache” and also the user credentials. It clears the Mondrian cache every time the ETL runs on the server. It clears cache for all the cubes that are present in the server so that it is ready for the fresh data and we wont face any cache related issue which has always been a headache for real time data.

Similarly, we can also use this mechanism to clear the cache for reporting metadata and also for the repository cache. We just have to follow the same procedure in this also.

 

Regards,

Nitish Kumar Mishra

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Connection Set-Up

Connection Setup for connection type: PostgreSQL

[We are now setting up a connection to A database i.e if your source is a database]
There are 3 ways of access provided for Connections using PostgreSQL
a. Native (JBDC)
b. ODBC
c. JNDI

a. Native (JBDC)
I. Enter the Connection Name: Anyname
II. Select the connection type: PostgreSQL
III. Host Name: localhost [This can be an ip address]
Database Name: LearnETL[Name of the database you are using]
Port Number: 5432 or your required Port Number
User Name: Your database user name
Password: Your database password
IV. Test Connections and OK.

b. JNDI
Here we need to go to the data-integration folder and open up the sub-folder “simple-jndi” and edit jdbc.properties

Here we need to write the following code:

ETL_Connection/type=javax.sql.DataSource
ETL_Connection/driver=org.postgresql.Driver
ETL_Connection/url=jdbc:postgresql://localhost:5432/LearnETL
ETL_Connection/user=postgres
ETL_Connection/password=postgres

ETL_Connection: name of the connection
localhost:5432/LearnETL: localhost is the host name, 5432 is the port number and LearnETL is the Database name.
user: username
password: Password

Save and back to the Database connection
Restart your PDI.

and in the Setup, select JNDI and for
JNDI Name : name of your connection [ETL_Connection]

c. ODBC

This is not commonly used but what will be needed

are as follows:

1.Install the PostgreSQL ODBC driver which can be downloaded.
2.Select the PostgreSQL ODBC Unicode and then
3. Setup
enter Data source Source, Name, Server, Username, Password and Port. Test and Save if Connection is OK.!
4. a bundle of JAR files to be copied in your Java folder as this ODBC bundle files has been discontinued in JAVA 8 Bridge

Thank You
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Pentaho BA Server 6.0 : How to enable Pentaho Marketplace and Pentaho CDE(CTools) in the Enterprise Edition

Enable Pentaho Marketplace and Pentaho C-Tools in Pentaho BA Server 6.0 EE

Pentaho 6.0 is the first version of the Pentaho enterprise-class server with new features to automate, manage and enhance every stage of the data pipeline. Pentaho 6.0 also introduces inline analytic model editing for sharing user-created measures and metrics in a collaborative manner.

1. Installing Pentaho Marketplace Plugin:

The Pentaho BA Server Enterprise Edition provides us all those things that are limited in Community Edition. However, we can still add the community plugins of the Pentaho server using the Marketplace in Pentaho. The Marketplace is generally disabled is Pentaho 6.0 EE server. There is no need to download to the Marketplace all over again. It’s an OSGi plugin and you already have it, but it comes disabled by default. Here are the instructions on how to enable it:

    • Edit the file
      pentaho-solutions/system/karaf/etc/org.apache.karaf.features.cfg
    • Find the key “featuresBoot” and add the feature “pentaho-marketplace”. That line will then be something like:
      featuresBoot=config,management,kar,cxf,camel,camel-blueprint,camel-stream,pentahoserver,mongolap,pentaho- metaverse,pdi-dataservice-client,pentaho-yarn-obf,pentaho-marketplace

      .

    • Restart the server and its done. The marketplace will be visible. You find it in dropdown of the Home Button. Please see the below screenshot for your reference:

Marketplace

2. Installing Pentaho CDE (Community Dashboards):

In enterprise edition, Pentaho has its own Dashboard but it has limited functionality as we cannot perform the tasks that we can do on the CDE. Therefore, developers always opt for CDE which is available to download from Pentaho Marketplace. If you go into Marketplace, follow the below steps,

      • There you would find CDF(Community Dashboard Framework), CDA(Community Data Access) and CDE(Community Dashboard Editor).
      • Download all of them.
      • Restart the server and it’s done.

Please refer to the below screenshot:

CDE

Thank you,

Nitish Kumar Mishra

WITH clause in MDX

WITH clause in MDX

In MDX queries, the WITH clause has an important role to play. WITH clauses are used to define some members which will be used in the SELECT clause for analysis purposes. The WITH clause contains either delarations of multiple sets and multiple members. Also, it is used to perform certain calculations like calculating profits,calculating prev year’s information,etc.

Below are some examples of the MDX queries using the WITH clause,

1. MDX Query to calculate the Total Sales for each year:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’
SELECT NON EMPTY {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
NON EMPTY [Time].[Years].members ON ROWS
FROM [SteelWheelsSales]

 

In the above example, the WITH clause was used to calculate the total sales value by multiplying the measures “Sales” and “Quantity”. The WITH clause creates a member [Measures].[Total Sales] (it can be used to create a new measure or a new dimension member to calculate based on the existing measures). After the WITH Clause, the member “Measures.Total Sales” is used as a measure in the SELECT clause and is displayed in the column axis whereas the dimensional data is displayed in the row axis.

Result:

With1

2. MDX Query to set a format string in a WITH Clause:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’,
format_string=’$#,##0.00′
SELECT {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
[Time].[Years].members ON ROWS
FROM [SteelWheelsSales]

In the same query, if we add a format_string keyword and specify the format in which you want to display the values in that particular member, then we would get the desired result in the output. Alternatively, you can use the name(for example, here its currency).
Result:
With2

3. MDX query to generate result using SET in WITH CLAUSE:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’, format_string = “$#,##0.00”
SET [Not 2005] as ‘Filter({[Time].[Years].Members}, ([Time].[Years].CurrentMember.Name <> “2005”))’
SELECT {[Measures].[Sales], [Measures].[Quantity], [Measures].[Total Sales]} ON COLUMNS,
[Not 2005] ON ROWS
FROM [SteelWheelsSales]

Now lets suppose you want to show all the data except for 2015. You can use the SET Function which will filter the data (as shown in the above example) and will show you the data for all except 2005.

Result:

With3

4. MDX Query to compare data for current year and prior year:

WITH member [Measures].[Total Sales] as ‘([Measures].[Sales] * [Measures].[Quantity])’
member [Measures].[Last Period] as ‘([Time].[Years].CurrentMember.PrevMember, [Measures].[Sales])’
SELECT {[Measures].[Sales],[Measures].[Last Period]} ON ROWS,
[Time].[Years].Members ON COLUMNS
FROM [SteelWheelsSales]

In the above expression, the member [Measures].[Last Period] is used to generate Sales information of the previous year. The CurrentMember.PrevMember function is used to generate the Previous Year information(you can do it similarly for any time member) based upon the Measure that you have specified. This is also done using WITH Clause.

Result:
With4

 

Conclusion:

The WITH Clause in MDX is very similar to the WITH Clause we use in SQL SERVER/Postgres. It helps us in understand the functionality better and is also used for faster processing. The WITH CACHE clause is used for optimization which helps in faster query execution.

 

 

Thanks,

Nitish Kumar Mishra