Guide to Slowly Changing Dimensions [Intro + Type 1]

Guide to Slowly Changing Dimensions [Intro + Type 1]

Firstly what is a dimension?
A dimension is a structure that categorizes facts and measures which can be used to understand business requirements.

What is a Slowly Changing Dimension?
A Slowly Changing Dimension are dimensions that change slowly over time.

Why is it needed?
In Data Warehousing which deals with historical data, tracking of changes is important as it helps to give a better understanding of the Business.

An Example
Imagine having a customer dimension in a taxation department which holds records of address and so on,
and a customer changes his/her address it is important to track the address changes ie from old to new.

Types of Slowly Changing Dimension.
1. Type 1
2. Type 2
3. Type 3
4. Type 4
5. Type 5

1. Type 1:
This follows the technique of replacing the previous value, here no Historic data is/are kept, commonly used in correction or updating of records.

In this blog i would be showing you how to create a Type 1 Slowly Changing Dimension using Pentaho Data Integration:

Steps
1. Identify your source or create one if you have to.
CSV input

2. Load Data into the Database table
Load Data

Now Open a New transformation
3. Select a input component for your source in my case its a CSV input
4. Select a Get System Info input Component [Optional]
Double Click on it and
For Type: Select System Date (fixed)
For Name: load_data (Any Name)

5. If you are using a Get System info then this step is necessary
Select a Input Component Join Rows (Cartesian Product)
What this does is it makes a cross join of the new Load Date column to the Table or source table.
Select OK

6. Select a Select Values Input component,
this is used to add or remove or alter fields in the source before going to the target if required.

Note Make Sure all the components are connected as Shown below:
Before insert
Double Click on Select Values Component

Click on Get fields to select
Click on Meta-data and again select Get Fields to change
Click OK!

7. Select a insert/update Input component
Select your target Schema
Select your target Table

Now Select get fields
and remove the fields you don’t require while leaving the main key or primary key to look up values.
in my case client Id

Then

Select Get Update Fields
And in the update column i will want Select N for the column i don’t want to update.
in my case again “client_id”

Now if selected the Get System info component to have a load date column then the below step is important

Remember we added load_date we want add that column

so Select SQL
and

look through the pre-defined query available and you will see a Alter Table to add column load_date is present
so all we need to do is Select Execute and run a query to see if your column has been added as shown below

load_date added

Click OK!

8. Now Go to your Source and edit a data and add a data also as shown below

added_data
and then Save it

I have added a new row and also editied Sohail’s region from NY to NJ

Now your all transformation should something like this with all the conncetions pointing to the right component

after all

Now Run the Transformation

Now after running successfully

Check your data via query

final data

As you can see data has being replaced/overwrite for Sohail [NY to NJ]
and also added 1 more new row Data.

In My next blog we will discuss Type 2 of Slowly Changing Dimension

Thanks
Izebhijie Sohail Ehizogie

Handling Rollback and Database transactions in Pentaho Data Integration.

Handling Rollback and Database transactions in Pentaho Data Integration.


 

 

While inserting the data into the database table it often happens that when a job/transformation fails in between the data load the whole transaction gets rolled back.

In general, the table output step in PDI will insert and commit any number of records until there is an error. Consider an example where the ETL process processes 5000 records and the 3999 record has a string instead of an integer (eg. the DDL defines the field as Integer), the job/transformation will come to a sudden halt.

Now the problem is though, that we have 3,999 records in the DB table from this current ETL run.In some situations this might not be an issue, e.g. if you have a CDC process set up which can deal with this (e.g. it can just start off the highest id or timestamp), but in other scenarios, this situation might be a problem.

Pentaho Data Integration provides a transactional feature, which is really quite easy to use:Make sure your DB engine supports transactions! E.g. MySQL MyISAM does not support transactions. So the transformation will run and insert records even though there was an error.

Transactions in transformations: simply by enabling the “Make the transformation database transactional” option in the “Miscellaneous” tab of the transformation settings dialog.

grfedgf

Note: database transactions will not work in the following cases:

  • This will disable the Use batch update for insert option in the Table Output step also ignore the Commit Size setting. So effectively this will slow down the insert operation.
  • if you use database partitioning and multiple connections per step copy need to be created. Performance wise it doesn’t make any sense in that case anyway.

Beginner’s Guide to E.T.L (Extract, Transform and Load) – A Basic Process

Loading Data from Source to Target

Before we proceed it’s important to identify the tool you might need to accomplish the process of ETL, in my case i would be using the Pentaho Data Integration Application (keep in mind irrespective of the tool you use the steps or procedures are similar but the approach might be differ).

The following Steps can be followed:

1. Identify your Source, it could be the following:
a. C.S.V file
b. Text file
c. A database
d. and so on

in my scenario a C.S.V (comma separated file)file.

2. Open Up your Spoon.bat
Select a new transformation and select a “Input” then select what type of input you require
we have Text File Input, C.S.V file input, SAS Input, Table Input and so on. In My case since i’ll be using C.S.V file as a Source i’ll select C.S.V file Input Component.

3. Set Up your Connection Based on your preferred connection type in my case i’ll be using the postgreSQL.

[Read my next Blog on setting Up a Connection using Connection type: PostgreSQL]

4. Once Connection has been Established you can Right-Click on the Connection and Select Share if that’s a common Connection to all your transformations will be using this will share the Transformation Connection to other transformation.

5. So we will be Sending data from Source to a Target to we need to to have a “Input” as the source
and an “Output” as the Target.

6. Input

download a C.S.V file from the internet

or even Create a TXT/C.S.V input file
as shown below.

Create a source if required
Text_Source (comma delimited)

Employee_Number,First_Name, Last_Name, Middle_Initial, Age, Gender, Title
101,Ehizogie,Izebhijie,Sohail,24,Male,Developer
102,Fahad,Anjum,WithClause,23,Male,Developer
103,Gayatri,Sharma,A,24,Female,Accountant

Save as txt or csv and this can be your input.

Here since our input is from a csv file
we open Up or C.S.V File Input component
Step Name: Anyname
File Name: Browse the selected path
Delimiter: , (comma)
Enclosure: ”

and then Select Get Fields and Select OK
Preview your Data

7. Output
Open Up Table output component
Select Target Schema
Select The Target Table

[Keeping in mind a Table exist in the DB]

Select OK!

Right-Click on the table output to MAP the columns from Source to Target

Now this is important as the to get the right data from the source to the column in the target
and Then Run

As a Beginner keep in mind that
Errors are a bound to occur
Such as Type of data from
Source Does not Match your Target Table format.
and so on.

Here a we can Have some little transformation step to convert and take care of such format errors
[In my next blog we can look into handling that]

Now Go to your Target Database and Run the SELECT * FROM table_name

there you go!

Thanks
Sohail Izebhijie

Beginner’s Guide to E.T.L (Extract, Transform and Load) – Introduction

Introduction into E.T.L (Extract, Transform and Load)
This a process related to data warehousing which involves the extracting of data out of the source system/Systems
and placing it into a repository or Target.

Extraction
Extracting the data from source systems (Flat Files or other operational systems) and converted into one consolidated data warehouse format which is ready for transformation.

Transformation
Transforming the data may involve the following tasks:

  • Cleaning: One of many very important task in the transforming stage because the Source data would always have data that the target system doesn’t support or understand hence cleaning is required.
    In some cases the Source can be from many source inputs so Lookup are important to avoid duplication.
  • Filtering: Now the Source Data would have so many rows but it’s important to send relevant data to your target and  filter out the unnecessary data.
  • Business Rules: Calculations or Derivations can be performed Here so we can have Correct and readable data at the target.

and many more.

Loading

After proper transformation and data matches the Business Rules loading the data into a target or repository is the final step in the E.T.L (Extract, Transform and Load)

in my next blog we will look into the basic Loading Data from Source to Target

Thanks
Sohail Izebhijie

Merge Join Vs Stream lookup in Pentaho DI

Merge Join Vs Stream lookup in Pentaho DI

Merge Join:

It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:

  • FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
  • LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
  • RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
  • INNER JOIN: only rows having the same key in both sources will be included in the result

Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.

 Stream Lookup:

The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.

Preserve memory:  Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)

Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)

Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)

Thank you

Lalitha

Memory Group By Vs Group By in Pentaho DI

Memory Group By Vs Group By in Pentaho DI

Group By and Memory Group By: These steps are useful to calculate aggregate values for the group of fields.

Even though both are useful for same purpose there are some differences and similarities as follows:

1) Group By works  only when the input rows are in sorted order  otherwise it will gives accurate result for the sequence records. So before using the Group By step the result should be sorted. Where as Memory Group by will handle non sorted records also.internally it will sort out and gives the output based on the selected aggregate option.

2) In Group By, there is an option to include all rows. If you enable this option it will give all the records, not just the aggregation records. to differentiate these two type of rows on output , we have to provide one boolean type of flag column. In Memory Group By we do not have this option.

3) In Group By, we have Temporary Files Directory option. If we enable include all rows then this option will be available. This option is to select the directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system. This is the option which allows us to store the data on disk so that if we have large amount of data we always prefer this one. And the Memory Group By step cannot handle huge amount of data as it always internally sorts out the data which will cause performance issue. So it will work fine for small amount of data only. So when the number of rows is too large to fit into memory, you need to use the combination of the Sort rows and Group by steps.

4) There is one more additional option in Group By step is, Add line number, restart in each group: Enable to add a line number that restarts at 1 in each group.

5) Always give back a result row : This option is same in both Group By and Memory Group By steps. If you enable this option, it will always give back a result row, even if there is no input row.  This can be useful if you want to count the number of rows.  Without this option you would never get a count of zero (0).

6) The field that make up the group: This option also works same in both the steps. After retrieving fields using the Get Fields button, designate the fields to include in the group.

7) Aggregates: Group By and Memory Group By , these both steps have same aggregate type functions. Specify the fields that must be aggregated, the method and the name of the resulting new field.
Here are the available aggregation methods :

  • Sum
  • Average (Mean)
  • Median
  • Percentile
  • Minimum
  • Maximum
  • Number of values (N)
  • Concatenate strings separated by , (comma)
  • First non-null value
  • Last non-null value
  • First value (including null)
  • Last value (including null)
  • Cumulative sum (all rows option only!)
  • Cumulative average (all rows option only!)
  • Standard deviation
  • Concatenate strings separated by <Value>: specify the separator in the Value column
  • Number of distinct values
  • Number of rows (without field argument)

Thank You

Lalitha

ETL Checklist

ETL Checklist

ETL Checklist is the things to know when you starting a ETL project.

Database Connections: Metadata and the jar files required for Source and Target Databases

Source data : We will have multiple sources and some times Single source also. We need to check is there any redundant data . We need to confirm regarding Data Types and data format. On which basis they are formatting data.

How the customer want to maintain the historical data. Is there any flags for deleted/invalid records. When to use SCD type 2 and SCD type 3 changes.

Logging and Auditing: We need to log all the details to track if any error occurs where it is exactly occurred. Because In our project we will have number of jobs and sub jobs.

So we have to track how many records are successfully inserted/updated

how many records got rejected. Job name, sub job name component name(if any error occurs we can track easily)

Rollback and Restarting Jobs: In ETL process the effected thing is always a Target database. So we need to keep check points in our job so that if any error occurs and results the job to be stopped. In this case ,when restarting the job the job should process from the last saving check point. So that no need to start the entire process again, which save the time and also improves the performance of the project.

Batch Loading: When we have a huge data, it is always better to load the data in multiple chunks rather than loading the entire data at a time. So Batch loading should be implemented.

Dependencies : In Target Database, we need to check for the dependencies like Lookup tables and Parent Tables which are connected with foreign key relationship. Are those having similar data types or not? Need to load the Parent tables first so that the foreign key relation will not get disturbed and also handled easily.

Alerting/Notifications: We have to implement Alerting/Notification mechanism. If any error occurs it should trigger a mail including error description,job name, component name etc.

Documentation: This is also one of the important task . Need to document the mappings in any format. These will be useful like a reference guide.

Thank you

Lalitha

 

Types of Facts in Data warehouse

Types of Facts in Data warehouse

What is Fact Table?

Fact Table consists of Measures/Metrics/Facts of  a business process. Typically it contains two types of data. Foreign keys which are holding the primary key of Dimension Tables and the second type is the sales,revenue, unit of product etc. measures.

fact

Measures Types:

Additive:  The measures which can be added across all the dimensions is known as Additive Measures.

Additive

In the above table , you can add the sales amount with respect to Product,Customer and also by Date.

Semi-Additive: The measures which can be added across some of the dimensions is known as Semi-Additive.

Semi-additive

Department wise we can calculate the total number of employees in that organization. But as per the Date you cannot say total employees in Sales department is 29+35=64 which is a wrong data. You can calculate total number of employees on 20150901 as 35+150=185. So the measures are possible only for some of the dimensions.

Non-Additive: The measures which cannot be added across any dimension is known as Non-Additive.

Non-Additive

We cannot calculate total profit margin with any of the dimension as shown in above table.

Fact less fact: This is not another type. It is like it may happen in real time that the fact will not store any measure kind of data. It is like set of events information. This is known as Fact less fact.

Factlessfact

In the above table, there is no fact, All the three columns derived from dimensions only. There is no facts in this scenario but we can retrieve the information how many students attended Java course on particular time etc.

Fact Types:

1)Transactional: The most common type of fact table, and each record is an event of transaction which will involve most of the dimensions. So it contains robust data, which enables maximum slicing and dicing of the data.

2)Periodic snapshots: The measurements occurring over a standard period, such as a day, a week, or a month. These fact tables are uniformly dense in their foreign keys because even if no activity takes place during the period, a row is typically inserted in the fact table containing a zero or null for each fact.

3)Accumulating snapshots : The measurement events occurring at predictable steps between the beginning and the end of a process. e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.

Thank You

Lalitha

Data Analysis Techniques

Data Analysis Techniques    

The need and way to analyse data basically depends on the end and not on the source.That end is typically the need to perform analysis and decision making through the use of that source of data. Data analysis in these days typically include reporting, multidimensional analysis and data mining which relates to “Display”, “Analyse” and “Discover” respectively. Depending on the type of data analysis the source data’s requirement may vary.

     If reporting is required for analysis then simplest of the data source would give best results.Query and reporting capability primarily consists of selecting associated data elements, perhaps summarizing them and grouping them by some category, and presenting the results. Retrieving relevant data from the data warehouse, transforming it into the appropriate context, and displaying it in a readable format.Finally, the report is delivered to the end user in the desired output format be it graph, pie and table form in the required output medium.

     If the objective is to perform multidimensional data analysis, a dimensional data model would be more appropriate. This type of analysis requires that the data model support a structure that enables fast and easy access to the data on the basis of any of numerous combinations of analysis dimensions. For example,you may want to know how many of a specific product were sold on a specific day, in a specific store, in a specific price range.Multidimensional analysis enables users to look at a large number of interdependent factors involved in a business problem and to view the data in complex relationships. End users are interested in exploring the data at different levels of detail, which is determined dynamically. The complex relationships can be analyzed through an iterative process that includes drilling down to lower levels of detail or rolling up to higher levels of summarization and aggregation.This is a data analysis operation whereby the user takes a different viewpoint than is typical on the results of the analysis, changing the way the dimensions are arranged in the result. Like query and reporting,multidimensional analysis continues until no more drilling down or rolling up is performed.

     As said before Data mining is nothing but “Discovery”. This discovery could take the form of finding significance in relationships between certain data elements, a clustering together of specific data elements,or other patterns in the usage of specific sets of data elements. After finding these patterns, the algorithms can infer rules. These rules can then be used to generate a model that can predict a desired behavior, identify relationships among the data, discover patterns, and group clusters of records with similar attributes.Data mining is most typically used for statistical data analysis and knowledge discovery. Statistical data analysis detects unusual patterns in data and applies statistical and mathematical modeling techniques to explain the patterns. Data mining is data driven . There is a high level of complexity in stored data and data interrelations in the data warehouse that are difficult to discover without data mining. Data mining offers new insights into the business that may not be discovered with query and reporting or multidimensional analysis. Data mining can help discover new insights about the business by giving us answers to unasked questions .

     These data analysis techniques offers new insights into the business through keen look into the data for its analysis to fetching quality information, which can be be used for ultimate buisness intelligence through profitable buisness decisions in order to grace buisness upliftment and growth in different sectors.

Types of Dimensions in Data warehouse

Types of Dimensions in Data warehouse

What is Dimension?

Dimension table contains the data about the business. The primary keys of the dimension tables are used in Fact tables with Foreign key relationship. And the remaining columns in the dimension is normal data which is the information about the Objects related to the business.

Eg: Product,Customer,Orders,Company,Date etc.

Below are the different types of Dimensions:

1) Slowly Changing Dimensions (SCD) : Dimensions that change very slowly overtime rather than according to regular schedule.

Below are some popular approaches of SCDs:

Type 0: This is Passive method. The records which were inserted will not change any time.

Type 1: This method overwrites the old record with new values without tracking any historical data.

SCDtype1 SCDtype1_2

Type 2: This methods tracks the historical data by using version number or by using startdate, enddate columns.The drawback is the table will grow vertically. And it requires more space.

SCD2

Type 3: This method will also track the historical data by inserting the column with new value. It preserves limited history.

SCD3

Type 4: This method uses a separate table for storing all the historical data and main table will hold the current_data. And both will pointed with the same surrogate key.

SCD4

Type 6: This is called hybrid method which uses all type 1+ type 2 + type 3. It uses flag values and effective date to differ between old and new records.

2) Rapidly Changing Dimensions: The Dimension which contains rapidly changing attributes. If we maintain any historical data for these type of tables. We will definitely get an issue related to memory and performance.  The solution is to maintain mini dimension tables for historical data  like type 4 Dimension in SCD. The main table should contain the current values and mini dimensions can contains historical data.

3) Junk Dimensions: In the data warehouse design we will come across a situation to use flag values. We can use one single table for this so that in Fact table no need to have multiple columns to store the Primary key values of these flag tables.

Eg:Junk

As per above, if we consider Gender_Marital_Status we can use only 1 single column in Fact table.

4) Inferred Dimensions: The Dimension which is important to create a fact table but it is not yet ready, then we can assign some dummy details for one ID and we can use that ID in fact table. After getting the details then we can update the details in the dimension.

5) Conformed Dimensions: Dimensions which are connected with multiple fact tables are know as conformed Dimension.

Eg: Customer Dimension and Product Dimension are required in Shipment Fact,Sales fact and Service Request Fact

6) Degenerate Dimensions: A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

Eg: Invoice_Number column in Fact table

Degenerated

7) Role Playing Dimensions: The same dimension which can be used for multiple purpose

Eg: Date Dimension can be used as Date of Sale, Date of Hire etc.

8) Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension

9) Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Thank You

Lalitha