image1

Loading CSV File BatchWise -Talend


Loading CSV File BatchWise -Talend


In a ETL-Job when the source is a flat file for example a CSV File and size of the file is large. To load large files your job has to read the whole file at a time. It puts load on the server which runs the job. Instead of reading the whole file, reading it in smaller batches and processing them is a better option. Following job explains how this is done using a java program which reads the CSV file with buffered reader and how they are processed later from the folder where the smaller-chunks file are present.


Following is the Java code which reads a file line wise according to the batch size as in how many lines.


package routines;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;

public class splitfile {

private File file;
private int numberOfLinesThreshold;
private String encoding;
private int readerBufferSize;
private String destinationFileName;

public splitfile(File file, int numberOfLinesThreshold, String encoding,
int readerBufferSize, String destinationFileName) {
this.file = file;
this.numberOfLinesThreshold = numberOfLinesThreshold;
this.encoding = encoding;
this.readerBufferSize = readerBufferSize;
this.destinationFileName = destinationFileName;
}

public void splitFile() {
String absolutePath = this.file.getAbsolutePath();
int index = absolutePath.lastIndexOf(“.”);

String path = absolutePath.substring(0, index);
String extension = absolutePath.substring(index + 1);

// Store the header for each file
String header = “”;
int fileNameCounter = 1;
int counter = 0;
StringBuilder stringBuilder = new StringBuilder();
String currentLine;
String lineSeparator = System.lineSeparator();

String fileName = path + “.” + extension;

try (BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileName),
this.encoding), this.readerBufferSize)) {
while ((currentLine = bufferedReader.readLine()) != null) {
if (“”.equals(header)) {
header = currentLine;
}

counter++;

if (counter == this.numberOfLinesThreshold) {
// Clear counter
counter = 0;

stringBuilder.append(currentLine);
// Utilize the existing string builder
writeFile(fileNameCounter, stringBuilder,
this.destinationFileName, extension);

// For next file name
fileNameCounter++;

// For next file content
stringBuilder = new StringBuilder();
stringBuilder.append(header).append(lineSeparator);
// As you have already added one line to the file
counter++;
} else {
stringBuilder.append(currentLine).append(lineSeparator);
}
}
// Write the last few remaining lines as another file
writeFile(fileNameCounter, stringBuilder, this.destinationFileName,
extension);
} catch (IOException ex) {
ex.printStackTrace();
}
}

private void writeFile(int fileNameCounter, StringBuilder stringBuilder,
String fileName, String extension) {
// Create a file and write the string builder
String name = getFileName(fileNameCounter, fileName, extension);
try (BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(name),
this.encoding))) {
bufferedWriter.write(stringBuilder.toString());
bufferedWriter.flush();
bufferedWriter.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}

private String getFileName(int fileNameCounter, String fileName,
String extension) {
return fileName + “_” + fileNameCounter + “.” + extension;
}
}

ETL_Job

Following is the main job which calls the child job in side the job and passes the path which has smaller chunk files.

image1

Step 1: Create Split Files folder

Creates a Folder in order to place all the splitted files. And writes the folder path to the context variable “Src_Split”. This folder is created in the “Src_Dir” and with name Split_(File Name).

image2

Step 2: Split File

Calls a Routine “splitfile” which reads the File in parts with the Help of a Buffered Reader. It takes 5 parameters.

1. Pass a File instance with an escape for the path where the source file is located,
(new java.io.File(escape.escapeSeq(context.Src_Dir+context.Src_Csv_Org)

2. The BatchSize which is assigned to context variable “BatchSize”.

3. Standard codes format of the file to the contextVariables “Enco4.des”.

4. Buffer Size to the context Variable “BufferSize”.

5. Destination Folder which is created is passed here followed by- Escape sequence in order to handle “\” and the file name. “context.Src_Split+context.Src_Csv”. (Src_Csv is just the name of the file with No extension.)

image3

Step 3 Split Folder

Give the Directory As “context.Src_Split” and File Mask as “*.csv”

Step 4 CSV2Stg_Stg_Dim_Organization

“Calls the Job which reads the files and processes them to load Stg_Dim_Organization. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”

image4

Step 5 CSV2Stg_Stg_Dim_Department

“Calls the Job which reads the files and processes them to load Stg_Dim_Department. Just set 2 context variables .

1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””

2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”

Using MDM components in Talend

Using MDM components in Talend


There are basically 2 types a component can be used in Talend.

1. Built-in


Built-in is giving in all the properties of a component manually yourself. Details like Server connection details, entities, elements, xpath etc.,

img25

Once, one has given all the details for the connection as shown in the above image. In the advanced setting manually give the Loop Xpath Query and in mapping give Columns and Xpath Query as mentioned in the MDM-UI while defining the Entity.

img26

2. Repository


Like any other DB connection one has to create a n MDM connection in the Repository in-order to drag and drop the schema of the MDM components.

Step1: Right click on the MDM Connection listed under metadata in the left side Pane. and select the Create New New Connection.

img27


Step2: Give in the following details and click next

img28


Step3: GIve in the server details like Username, Password and URL. and click on check to check the connection.

img29


Step4: If the given details are correct and connection is established a window pops up like the following.

img30


Step5: Select the data model and container which you have created in the MDM-UI and the one which is deployed and available are shown in the dropdown menu.

img31


Step6: Retrieve the Entities by right clicking on the instance created in the repository and selecting Retrieve Entity

img32


Step7: select the mode u want to use the entity as an Input/Output/Recieve

img34

Step8: select the entity from the dropdown menu and give the name and click next

img33


Step9: Give the target schema xpath and click on finish

img34


Entity is now available in the meta-data. It is in the left side of the UI.

img34


This entity is now available this can be dragged and dropped into job. Schema is also available manual inputs are not required as required in built-in.

img34

Work Flow of Talend MDM


Work Flow of Talend MDM


 
Talend MDM stores and manages the master data. It also provides platform to do data modelling for the master data. Domains are called entities here. The data is stored into XML format or in sql in relational database. It is based on the setting made according to preference.There are 2 important parts of MDM one must understand before starting with MDM.

  1. Data Model: This is where one can define the schema and assign properties to an entity(domain). The structure of an entity can be defined here like define key, datatype, columns, foreign constraints, dependencies etc..
  2. Data Container: This is where the data physically exists. It can be of wither XML format or SQL as per settings made while installing MDM. Once the Schema/structure is defined, data container is provided with the xpath of the structure defined in the data model.

Following is an example of defining the product data model. Having entities product, sales, customers.

Step1: GO to the Talend UI with MDM and create a sample project selecting the language for source as java or pearl. Then go to the UI there is Data Model in the MDM Repository pane on the left hand side right click and click on +NEW to add an entity.

img15

Give the name as product and click “OK”.
Step2: give the description. Add an entity by right click on the pane below data model entities. And name it “Product”

img16
Step3: Define the Key and Change its simple type to “Custom” and select “Auto Increment” from the drop down.

img17
Step4: Add the elements after Key by adding elements for “Description” and “Code”.

img18

Step5: Repeat all the steps and create 2 more elements with names “Customer” and “Sales”.

Step6: Add extra element “product” in the Customer and put a foreign key constraint on it by right click and pointing to “ProductId” element from Product Entity. Also add a column “product” in sales and do same.

img19

Step7:Then Create a view by right click and select the Generate Default browse items view

img20

U can view the items under the View under Web Filters in the left panel.

img21

Step8: Deploy data model, data container and views to the Local server by adding server details. In the pane below in properties.

img22

Note: The product is automatically creates a data container by assigning xpath to all the elements in all the entities.

Then Finally u get to see it server in url http://localhost:8180/talendmdm/ui . There is Product available in both data model and data container. Click on “Save”. and click on the ” browse master data Items ” to view the entities.

img23

There are 2 ways to put it in the data
1. Directly single record at a time from the browser(One must have the Admin rights to write permissions).
2. By import (You can export the empty entity to get the schema and format in which it accepts the data).


This data can be used in the data integration tool by using MDM components TMDMInput to Read from MDM and TMDMOutput to write the data into MDM. It has similar components like any other database.

img24

One can use and go through “Data Profiling” Perspective of Talend to clean the data and optimize the usage of MDM.

 
Asha Bidiyasar

Install and setup Talend MDM

Master Data Management -MDM

Master data is the mostly consistent data which is spread across the various systems in an enterprise and possibly their partners. It mostly contains Id, Code, Description and other properties which defines a particular domain. It is not the quantitative data related to a single transaction, it is not an operational data in simple words.

MDM, basically generates a key and manages the master data it involves data cleaning, governance, tracking and control.

Installation and setup of Talend MDM:


Install MDM server from Talend website for the compatible version. This document covers installation of v6.0.0 of MDM server. There are 2 options 1. Executable jar 2. Zip file Extract, select any one of them and do the same.
There are 2 folders in it. One is the UI and the other is server.

  1. UI is nothing but the Talend Data Integration + MDM Perspective included. (TOS_MDM-Studio-20150702_1326-V6.0.0)
  2. MDM sever runs on tomcat sever it uses apache tomcat 8.0.20 version. (TOS_MDM-Server-20150702_1326-V6.0.0)



img1


Step1: If your system’s pop’s up for the firewall permission grant it for running the application.

img2


Step2: Select the Language and click ok.


img3



Step3: Just Click next to proceed

img4



Step4: Accept the license and click next

img5



Step5: Select the server and application if u do not have the Talend UI for MDM.

img6


Step6: Give in the path to install the server.

img7


Step7:
Select the Datbase type. Community edition gives the option of H2 Embedded database select that and click next

img8


Step8: Select from the check box if u want to use the single user for all the data containers or create individual database for each. Then, give in username and password to proceed.

img9



Step9: Give in the directory for database.

img10


Step10: Cross verify the paths and installation packs..

img11


Step11: Click next and finally on Finish to complete the installation
img12


Run the server by Running “startup.bat” executable file from ……\TOS_MDM-Server-20150702_1326-V6.0.0\apache-tomcat-8.0.20\bin

img13

Run it on browser using localhost:8180/talendmdm/

There are 2 default user

  • Administrator: username/password -> Has all rights read/write
  • User: username/password -> only read
  • img14

    Processing Multi schema files in Talend

    Processing Multi schema files in Talend

    Some times, we will get the input files which contains the data about different schema format but in single file.  We can process these files to get the data schema wise by using tFileinputMSDelimited.

    Raw Data:

    Here is a sample raw data file. In which it exists 3 different schema types.

    Rawdata

    If you observe the first column. It represents the type of that data

    ORD means it is Order details

    CUS means it is Customer Details

    and TRS means it is Transactions.

    All these 3 exists in same file. But our requirement is like to have all these 3 types of data in separate files for each type.

    Job:

    Create the below job to get separate files/outputs for each schema type

    In the palette you will find tFileInputMSDelimited component drag it on to the work area.

    Now open component tab to configure it as shown below:

    component

    Click on Multi Schema Editor And configure as shown in below screen shot:

    Component_config

    If the column separators are different for each schema you can enable Use Multiple Separators option.

    After getting the preview click on the Fetch codes

    fetchcodes

    Now take 3 output components which ever you need it. I selected tLogrow components to display the result on the console.

    Execute the job:

    Job exec

    Now check the output you will get three different tables based on the schema type.

    Output:

    Output

    Thanks and Regards,

    Lalitha

    Handling Null values in Talend and Pentaho

    Handling Null values in Talend and Pentaho

    Handling Null values in Talend:

    Make all the columns as Nullable in tMap

    Now for the columns which are having Null values you can assign default values by writing  ternary operator as shown in below screen-shot.

    talend_tmap

    Handling Null values in Pentaho:

    In Pentaho, to handle null values we have a component to provide a default value for the null values:

    ‘If field value is null’ is the component name.

    You can assign default value for null values for the fields and also as per the datatypes.

    You can configure the component as shown below:

    pentaho_null

    Thank you.

    Lalitha

    Full Outer Join by using tmap in Talend

    Full Outer Join by using tmap in Talend

    In tMap, the default Join Model is Left Outer Join and also we have  Inner Join as another Join Model. Suppose if we need the result for full outer join ,you can follow the below process.

    I have Customers.csv and City.csv as my two delimited files. where City is the common field in the both the files.

    First, please find the data in the two csv files as below:

    Customers.csv:

    customers

    City.csv:

    city

    Following is the Talend Job :

    job

    First create a FileDelimited Metadata in the repository as shown in above picture.

    Consider Customers as main link and city as lookup and join both to tMap component. In tMap you have to select Join Model as Inner Join and catch the inner join output reject records and configure as shown in below screen-shot:

    tmap1

    Now as shown in the job image, again take same city and customer inputs but change city as main link and customers as lookup.

    In tmap select the Join Model as Left Outer join and also configure the tMap2 component as shown below:

    tmap2

    Now take tunite component to retrieve the innerjoin reject values from tMap1 and left outer join values from tMap2. And now take tsortrow to sort the result in asc/desc order(I consider custid column) and then the final result stored on fulljoin.csv file.

    And the out put is as follows:

    fulljoin

    Thanks and Regards,

    Lalitha

    Creating Dynamic Drop Down List in Talend

    Creating Dynamic Drop Down List in Talend

    Created a sample job to retrieve the records from a particular table only by selecting the table name from a drop down list.

    First we have to create tables and load the data into the database. We have to get these tables in a particular database as drop down list. So that we can select a table name at run time.

    Here is a point to note that whenever we are using tjava components it will execute first even in the sub-job. So I am using tprejob component to display the list of table names. Please find the screenshot of the entire job.

    main_job

    PreJob:

    In my required database I have 3 tables. So I have to retrieve these table names as list. to store these names we require one context variable. So create a context variable as shown below:

    context

    I am using mysql database. To retrieve the tables name I have taken tmysqlInput component and configured database name, port,username, password . And in Query part write the below Query:

    “select Table_name from information_schema.`TABLES` where Table_schema='<databasename>'”

    table_names_qry

    Take tJavaflex component and connect with tmysqlInput by using ‘row’. tJavaflex will execute the code by loop wise for each record of its input flow.

    First we have to create an array list type variable in start code. And then need to add each row (table name) from input flow to this array list in the main code section. And in the End code section, convert the list to an array and then create a drop down list.

    javaflex

    End Code:

    String [] TableNames=new String[tables_list.size()];
    TableNames=tables_list.toArray(TableNames);
    JFrame frame = new JFrame(“Input Dialog Example”);
    String returnValue = (String) JOptionPane.showInputDialog(frame,
    “Select the table name you want to query?”,”Tables Information”,JOptionPane.QUESTION_MESSAGE,null,TableNames,TableNames[0]);
    context.tables=returnValue;
    if(context.tables != null)
    System.out.println(“selected table: “+context.tables);

     //on clicking cancel button in drop down box it will show this message
    else System.out.println(“please select table name”);

    In the Advanced settings of tjavaflex import required java libraries as shown below:

    tjavaflex_import

    Pre-job is completed.

    Main Job:

    After selecting the drop down list  based on the selected table name that particular sub-job condition will become true and it will retrieve the records from selected table.

    In case, if we not select the table from drop down list on clicking cancel button, it will throw null pointer exception. To avoid that please enter the below code in all tjava components in the sub-jobs.

    if(context.tables == null)
    {
    context.tables=”0″;
    }

    Connect tmysqlInput component to tjava by using Runif condition, please write the below code in the runif condition:

    context.tables.contains(“categories”).Here Categories is one of my table name. If we select this table in drop down it will become true and execute otherwise it will become false.

    Now configure the tmysqlInput component with port,user name,database name,password and table name.And now the table name will store in the context variable. so you can use the context variable in place of table name in the select query. Please find the below screen shot:

    sql_input

    And connect the tlogrow to display the result on console.

    Configure the remaining sub-jobs also in the same way.

    Now execute the job, it will display table names in the drop down list.

    dd1

    If you click on cancel, on the console it will display the message as “Please select the table name”

    cancel

    If we select any one of the table name from the list, that particular matched condition will become true and that sub-job will execute and display the records of that table on the console. Here I selected Customer table from the drop down so it will display the records from customer table:

    selected table

    Result:

    result

    Types of CDC supported by Talend Enterprise Edition

    Types of CDC supported by Talend Enterprise Edition :

     

    CDC (Change Data Capture ) :

    The CDC architecture is based on the publisher/subscriber model.

    The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from the publisher.

    The main tasks performed by the publisher are:

    -Identifying the source tables from which the change data needs to be captured.

    -capturing the change data and storing it in specially created change tables.

    Following are the two different types of CDC modes available in Talend Enterprise Edition.

    1.Trigger mode
    2.Redo/Archive log Mode

    Trigger mode :

    In this mode we need to setup the CDC feature in Talend Studio so that it will create required triggers on the selected database.

    Following are the two major steps required to establish the CDC environment on Talend Studio:

    1.Configure CDC in Trigger mode

    2.Extract change data modified in Trigger mode

    Supported databases for CDC under Trigger mode :

    CDC feature available only for DB2, Informix, Ingres, MS SQL Server, MySQL, PostgreSQL ,Oracle and Sybase.

    Talend provides different CDC component for each supported database.
    Some of them are list below :

    tMySQLCDC, tMSSQLCDC,tDB2CDC,tInformixCDC,tOracleCDC,tPostgreSQLCDC and SybaseCDC

    Redo/Archive log mode :

    The Oracle Redo/Archive log mode used in Talend is equivalent to the archive log mode of Oracle.

    The CDC feature in this mode for Oracle is available for the 10g version of the Oracle database and later versions. Before being able to use CDC in Redo/Archive log mode in Talend Studio, the administrator of the database to be supervised should do the following:

    1.Activate the archive log mode in the Oracle database.

    2.Set up CDC in the Oracle database.

    3.Create and give all rights to the source user.

    4.Create and give all rights to the publisher.

    However,if you want to use CDC in Redo/Archive log mode for an Oracle, you must first of all configure the database so that it generates the redo records that hold all insert, update or delete changes made in datafiles.

    Supported databases for CDC under Redo/Archive log mode :

    CDC feature available only for Oracle and AS400.

    Talend components for this mode are tAS400CDC and tOracleCDC

     

    buildjob

    Scheduling Talend Job on both Windows and Linux OS

    Scheduling Talend Job on both Windows and Linux OS

    In order to schedule a Talend Job, first we have to export the job as build job which will generate the runnable files to execute on required OS. And also all the child jobs in that project converted as jar files.

    Right click on the parent/main job of your project and select Build Job option.

    It will prompt a window, the job will create a zip file.

    You have to choose the path and build type should be Autonomous. And also enable the option Extract the zip file. It will unzip the above zip file in the same path that you have given.

    Enable Shell Launcher and Context scripts. For Shell Launcher choose All as the value to generate both runnable files for Win OS and Linux OS. It will generate two executable/runnable files like batch file for Windows OS and Shell file for Linux.

     

    Schedule Talend Job on Windows OS:

    On Windows OS, we can schedule the Talend job by using Windows Task Scheduler.

    On the right side, under actions you can click on Create Basic Task. And then we can select when the task should run as shown in below image:

     

    You can also select exact time to run:

     

     

    Now select ‘Start a Program’ option under Action

     

     

    Now browse the batch file and also you can add any context parameters if needed.

    startprogram

     

     

    Now click on Finish. Task will be scheduled for the selected time.

    finish

    Schedule Talend Job on Linux OS:

    On Linux OS we use crontab to schedule tasks

    Type the below command to open crontab file:

    crontab -e

    And add the job which need to be schedule as shown in below image:

    linux

    Hope this post will be useful to refer when scheduling Talend Job.