Merge Join Vs Stream lookup in Pentaho DI

Merge Join Vs Stream lookup in Pentaho DI

Merge Join:

It joins two data sets which are coming from two table Inputs. In this steps the below Types of Joins are available:

  • FULL OUTER: all rows from both sources will be included in the result, with empty values for non-matching keys in both data streams
  • LEFT OUTER: all rows from the first source will be in the result, with empty values for non-matching keys in the second data stream
  • RIGHT OUTER: all rows from the second source will be in the result, with empty values for non-matching keys in the first data stream
  • INNER JOIN: only rows having the same key in both sources will be included in the result

Note: In this step rows are expected in to be sorted on the specified key fields. When using the Sort step, this works fine.

 Stream Lookup:

The Stream lookup step type allows you to look up data using information coming from other steps in the transformation. The data coming from the Source step is first read into memory and is then used to look up data from the main stream.

Preserve memory:  Encodes rows of data to preserve memory while sorting. (Technical background: Kettle will store the lookup data as raw bytes in a custom storage object that uses a hashcode of the bytes as the key. More CPU cost related to calculating the hashcode, less memory needed.)

Key and value are exactly one integer field: Preserves memory while executing a sort by . Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Use sorted list” option.
(Technical background: The lookup data is stored in a custom storage object that is similar to the byte array hashmap, but it doesn’t have to convert to raw bytes. It just takes a hashcode of the long.)

Use sorted list: Enable to store values using a sorted list; this provides better memory usage when working with data sets containing wide row. Note: Works only when “Preserve memory” is checked. Cannot be combined with the “Key and value are exactly one integer field” option. (Technical background: the lookup data is put into a tuple and stored in a sorted list. Lookups are done via a binary tree search.)

Thank you

Lalitha

Full Outer Join by using tmap in Talend

Full Outer Join by using tmap in Talend

In tMap, the default Join Model is Left Outer Join and also we have  Inner Join as another Join Model. Suppose if we need the result for full outer join ,you can follow the below process.

I have Customers.csv and City.csv as my two delimited files. where City is the common field in the both the files.

First, please find the data in the two csv files as below:

Customers.csv:

customers

City.csv:

city

Following is the Talend Job :

job

First create a FileDelimited Metadata in the repository as shown in above picture.

Consider Customers as main link and city as lookup and join both to tMap component. In tMap you have to select Join Model as Inner Join and catch the inner join output reject records and configure as shown in below screen-shot:

tmap1

Now as shown in the job image, again take same city and customer inputs but change city as main link and customers as lookup.

In tmap select the Join Model as Left Outer join and also configure the tMap2 component as shown below:

tmap2

Now take tunite component to retrieve the innerjoin reject values from tMap1 and left outer join values from tMap2. And now take tsortrow to sort the result in asc/desc order(I consider custid column) and then the final result stored on fulljoin.csv file.

And the out put is as follows:

fulljoin

Thanks and Regards,

Lalitha