Different ways of implementing loop in PDI

Different ways of implementing loop in PDI


We come across certain scenarios where in running of the Job/Transformation in loop becomes necessary in Pentaho. Below are the different ways of implementing the loop to the Job in PDI.

 

1. Using Execute for every input row

Build a job, and let the first transformation retrieve all the rows via Table Input. Use a “Copy rows to result” step to save the result. Create a second transformation configured for single row execution – refer the image advanced transformation settings. Use “Get rows from result” to start single row processing.

blog_3

 

 

2. Use simple evaluation step

blog_1

  • Get the number of count for the loop

  • Assign the count to a variable.

  • Call the variable in the simple evaluation step( as shown in the figure below)

    blog_2

  • Give the appropriate condition according to the requirement.

  • Connect the hop to the job which has to run in a loop.

Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Merge Join Vs Stream lookup in Pentaho DI

Merge Join Vs Stream lookup in Pentaho DI

Merge Join:

It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:

  • FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
  • LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
  • RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
  • INNER JOIN: only rows having the same key in both sources will be included in the result

Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.

 Stream Lookup:

The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.

Preserve memory:  Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)

Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)

Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)

Thank you

Lalitha

Memory Group By Vs Group By in Pentaho DI

Memory Group By Vs Group By in Pentaho DI

Group By and Memory Group By: These steps are useful to calculate aggregate values for the group of fields.

Even though both are useful for same purpose there are some differences and similarities as follows:

1) Group By works  only when the input rows are in sorted order  otherwise it will gives accurate result for the sequence records. So before using the Group By step the result should be sorted. Where as Memory Group by will handle non sorted records also.internally it will sort out and gives the output based on the selected aggregate option.

2) In Group By, there is an option to include all rows. If you enable this option it will give all the records, not just the aggregation records. to differentiate these two type of rows on output , we have to provide one boolean type of flag column. In Memory Group By we do not have this option.

3) In Group By, we have Temporary Files Directory option. If we enable include all rows then this option will be available. This option is to select the directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system. This is the option which allows us to store the data on disk so that if we have large amount of data we always prefer this one. And the Memory Group By step cannot handle huge amount of data as it always internally sorts out the data which will cause performance issue. So it will work fine for small amount of data only. So when the number of rows is too large to fit into memory, you need to use the combination of the Sort rows and Group by steps.

4) There is one more additional option in Group By step is, Add line number, restart in each group: Enable to add a line number that restarts at 1 in each group.

5) Always give back a result row : This option is same in both Group By and Memory Group By steps. If you enable this option, it will always give back a result row, even if there is no input row.  This can be useful if you want to count the number of rows.  Without this option you would never get a count of zero (0).

6) The field that make up the group: This option also works same in both the steps. After retrieving fields using the Get Fields button, designate the fields to include in the group.

7) Aggregates: Group By and Memory Group By , these both steps have same aggregate type functions. Specify the fields that must be aggregated, the method and the name of the resulting new field.
Here are the available aggregation methods :

  • Sum
  • Average (Mean)
  • Median
  • Percentile
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only!)
  • Cumulative average (all rows option only!)
  • Standard deviation
  • Concatenate strings separated by <Value>: specify the separator in the Value column
  • Number of distinct values
  • Number of rows (without field argument)

Thank You

Lalitha

ETL Checklist

ETL Checklist

ETL Checklist is the things to know when you starting a ETL project.

Database Connections: Metadata and the jar files required for Source and Target Databases

Source data : We will have multiple sources and some times Single source also. We need to check is there any redundant data . We need to confirm regarding Data Types and data format. On which basis they are formatting data.

How the customer want to maintain the historical data. Is there any flags for deleted/invalid records. When to use SCD type 2 and SCD type 3 changes.

Logging and Auditing: We need to log all the details to track if any error occurs where it is exactly occurred. Because In our project we will have number of jobs and sub jobs.

So we have to track how many records are successfully inserted/updated

how many records got rejected. Job name, sub job name component name(if any error occurs we can track easily)

Rollback and Restarting Jobs: In ETL process the effected thing is always a Target database. So we need to keep check points in our job so that if any error occurs and results the job to be stopped. In this case ,when restarting the job the job should process from the last saving check point. So that no need to start the entire process again, which save the time and also improves the performance of the project.

Batch Loading: When we have a huge data, it is always better to load the data in multiple chunks rather than loading the entire data at a time. So Batch loading should be implemented.

Dependencies : In Target Database, we need to check for the dependencies like Lookup tables and Parent Tables which are connected with foreign key relationship. Are those having similar data types or not? Need to load the Parent tables first so that the foreign key relation will not get disturbed and also handled easily.

Alerting/Notifications: We have to implement Alerting/Notification mechanism. If any error occurs it should trigger a mail including error description,job name, component name etc.

Documentation: This is also one of the important task . Need to document the mappings in any format. These will be useful like a reference guide.

Thank you

Lalitha

 

Types of Facts in Data warehouse

Types of Facts in Data warehouse

What is Fact Table?

Fact Table consists of Measures/Metrics/Facts of  a business process. Typically it contains two types of data. Foreign keys which are holding the primary key of Dimension Tables and the second type is the sales,revenue, unit of product etc. measures.

fact

Measures Types:

Additive:  The measures which can be added across all the dimensions is known as Additive Measures.

Additive

In the above table , you can add the sales amount with respect to Product,Customer and also by Date.

Semi-Additive: The measures which can be added across some of the dimensions is known as Semi-Additive.

Semi-additive

Department wise we can calculate the total number of employees in that organization. But as per the Date you cannot say total employees in Sales department is 29+35=64 which is a wrong data. You can calculate total number of employees on 20150901 as 35+150=185. So the measures are possible only for some of the dimensions.

Non-Additive: The measures which cannot be added across any dimension is known as Non-Additive.

Non-Additive

We cannot calculate total profit margin with any of the dimension as shown in above table.

Fact less fact: This is not another type. It is like it may happen in real time that the fact will not store any measure kind of data. It is like set of events information. This is known as Fact less fact.

Factlessfact

In the above table, there is no fact, All the three columns derived from dimensions only. There is no facts in this scenario but we can retrieve the information how many students attended Java course on particular time etc.

Fact Types:

1)Transactional: The most common type of fact table, and each record is an event of transaction which will involve most of the dimensions. So it contains robust data, which enables maximum slicing and dicing of the data.

2)Periodic snapshots: The measurements occurring over a standard period, such as a day, a week, or a month. These fact tables are uniformly dense in their foreign keys because even if no activity takes place during the period, a row is typically inserted in the fact table containing a zero or null for each fact.

3)Accumulating snapshots : The measurement events occurring at predictable steps between the beginning and the end of a process. e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.

Thank You

Lalitha

Data Analysis Techniques

Data Analysis Techniques    

The need and way to analyse data basically depends on the end and not on the source.That end is typically the need to perform analysis and decision making through the use of that source of data. Data analysis in these days typically include reporting, multidimensional analysis and data mining which relates to “Display”, “Analyse” and “Discover” respectively. Depending on the type of data analysis the source data’s requirement may vary.

     If reporting is required for analysis then simplest of the data source would give best results.Query and reporting capability primarily consists of selecting associated data elements, perhaps summarizing them and grouping them by some category, and presenting the results. Retrieving relevant data from the data warehouse, transforming it into the appropriate context, and displaying it in a readable format.Finally, the report is delivered to the end user in the desired output format be it graph, pie and table form in the required output medium.

     If the objective is to perform multidimensional data analysis, a dimensional data model would be more appropriate. This type of analysis requires that the data model support a structure that enables fast and easy access to the data on the basis of any of numerous combinations of analysis dimensions. For example,you may want to know how many of a specific product were sold on a specific day, in a specific store, in a specific price range.Multidimensional analysis enables users to look at a large number of interdependent factors involved in a business problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically. The complex relationships can be analyzed through an iterative process that includes drilling down to lower levels of detail or rolling up to higher levels of summarization and aggregation.This is a data analysis operation whereby the user takes a different viewpoint than is typical on the results of the analysis, changing the way the dimensions are arranged in the result. Like query and reporting,multidimensional analysis continues until no more drilling down or rolling up is performed.

     As said before Data mining is nothing but “Discovery”. This discovery could take the form of finding significance in relationships between certain data elements, a clustering together of specific data elements,or other patterns in the usage of specific sets of data elements. After finding these patterns, the algorithms can infer rules. These rules can then be used to generate a model that can predict a desired behavior, identify relationships among the data, discover patterns, and group clusters of records with similar attributes.Data mining is most typically used for statistical data analysis and knowledge discovery. Statistical data analysis detects unusual patterns in data and applies statistical and mathematical modeling techniques to explain the patterns. Data mining is data driven . There is a high level of complexity in stored data and data interrelations in the data warehouse that are difficult to discover without data mining. Data mining offers new insights into the business that may not be discovered with query and reporting or multidimensional analysis. Data mining can help discover new insights about the business by giving us answers to unasked questions .

     These data analysis techniques offers new insights into the business through keen look into the data for its analysis to fetching quality information, which can be be used for ultimate buisness intelligence through profitable buisness decisions in order to grace buisness upliftment and growth in different sectors.