Loading CSV File BatchWise -Talend
In a ETL-Job when the source is a flat file for example a CSV File and size of the file is large. To load large files your job has to read the whole file at a time. It puts load on the server which runs the job. Instead of reading the whole file, reading it in smaller batches and processing them is a better option. Following job explains how this is done using a java program which reads the CSV file with buffered reader and how they are processed later from the folder where the smaller-chunks file is present.
Make data easy with Helical Insight.
Helical Insight is the world’s best open source business intelligence tool.
Following is the Java code which reads a file line wise according to the batch size as in how many lines.
package routines;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
public class splitfile {
private File file;
private int numberOfLinesThreshold;
private String encoding;
private int readerBufferSize;
private String destinationFileName;
public splitfile(File file, int numberOfLinesThreshold, String encoding,
int readerBufferSize, String destinationFileName) {
this.file = file;
this.numberOfLinesThreshold = numberOfLinesThreshold;
this.encoding = encoding;
this.readerBufferSize = readerBufferSize;
this.destinationFileName = destinationFileName;
}
public void splitFile() {
String absolutePath = this.file.getAbsolutePath();
int index = absolutePath.lastIndexOf(“.”);
String path = absolutePath.substring(0, index);
String extension = absolutePath.substring(index + 1);
// Store the header for each file
String header = “”;
int fileNameCounter = 1;
int counter = 0;
StringBuilder stringBuilder = new StringBuilder();
String currentLine;
String lineSeparator = System.lineSeparator();
String fileName = path + “.” + extension;
try (BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileName),
this.encoding), this.readerBufferSize)) {
while ((currentLine = bufferedReader.readLine()) != null) {
if (“”.equals(header)) {
header = currentLine;
}
counter++;
if (counter == this.numberOfLinesThreshold) {
// Clear counter
counter = 0;
stringBuilder.append(currentLine);
// Utilize the existing string builder
writeFile(fileNameCounter, stringBuilder,
this.destinationFileName, extension);
// For next file name
fileNameCounter++;
// For next file content
stringBuilder = new StringBuilder();
stringBuilder.append(header).append(lineSeparator);
// As you have already added one line to the file
counter++;
} else {
stringBuilder.append(currentLine).append(lineSeparator);
}
}
// Write the last few remaining lines as another file
writeFile(fileNameCounter, stringBuilder, this.destinationFileName,
extension);
} catch (IOException ex) {
ex.printStackTrace();
}
}
private void writeFile(int fileNameCounter, StringBuilder stringBuilder,
String fileName, String extension) {
// Create a file and write the string builder
String name = getFileName(fileNameCounter, fileName, extension);
try (BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(name),
this.encoding))) {
bufferedWriter.write(stringBuilder.toString());
bufferedWriter.flush();
bufferedWriter.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
private String getFileName(int fileNameCounter, String fileName,
String extension) {
return fileName + “_” + fileNameCounter + “.” + extension;
}
}
ETL_Job
Following is the main job which calls the child job in side the job and passes the path which has smaller chunk files.
Step 1: Create Split Files folder
Creates a Folder in order to place all the splitted files. And writes the folder path to the context variable “Src_Split”. This folder is created in the “Src_Dir” and with name Split_(File Name).
Make data easy with Helical Insight.
Helical Insight is the world’s best open source business intelligence tool.
Step 2: Split File
Calls a Routine “splitfile” which reads the File in parts with the Help of a Buffered Reader. It takes 5 parameters.
1. Pass a File instance with an escape for the path where the source file is located,
(new java.io.File(escape.escapeSeq(context.Src_Dir+context.Src_Csv_Org)
2. The BatchSize which is assigned to context variable “BatchSize”.
3. Standard codes format of the file to the contextVariables “Enco4.des”.
4. Buffer Size to the context Variable “BufferSize”.
5. Destination Folder which is created is passed here followed by- Escape sequence in order to handle “\” and the file name. “context.Src_Split+context.Src_Csv”. (Src_Csv is just the name of the file with No extension.)
Step 3 Split Folder
Give the Directory As “context.Src_Split” and File Mask as “*.csv”
Step 4 CSV2Stg_Stg_Dim_Organization
“Calls the Job which reads the files and processes them to load Stg_Dim_Organization. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””
2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”
Step 5 CSV2Stg_Stg_Dim_Department
“Calls the Job which reads the files and processes them to load Stg_Dim_Department. Just set 2 context variables .
1. Src_Folder to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILEPATH””))””
2. Src_Csv to “”((String)globalMap.get(“”tFileList_1_CURRENT_FILE””))””.”
Best Open Source Business Intelligence Software Helical Insight is Here