ETL Checklist

ETL Checklist

ETL Checklist is the things to know when you starting a ETL project.

Database Connections: Metadata and the jar files required for Source and Target Databases

Source data : We will have multiple sources and some times Single source also. We need to check is there any redundant data . We need to confirm regarding Data Types and data format. On which basis they are formatting data.

How the customer want to maintain the historical data. Is there any flags for deleted/invalid records. When to use SCD type 2 and SCD type 3 changes.

Logging and Auditing: We need to log all the details to track if any error occurs where it is exactly occurred. Because In our project we will have number of jobs and sub jobs.

So we have to track how many records are successfully inserted/updated

how many records got rejected. Job name, sub job name component name(if any error occurs we can track easily)

Rollback and Restarting Jobs: In ETL process the effected thing is always a Target database. So we need to keep check points in our job so that if any error occurs and results the job to be stopped. In this case ,when restarting the job the job should process from the last saving check point. So that no need to start the entire process again, which save the time and also improves the performance of the project.

Batch Loading: When we have a huge data, it is always better to load the data in multiple chunks rather than loading the entire data at a time. So Batch loading should be implemented.

Dependencies : In Target Database, we need to check for the dependencies like Lookup tables and Parent Tables which are connected with foreign key relationship. Are those having similar data types or not? Need to load the Parent tables first so that the foreign key relation will not get disturbed and also handled easily.

Alerting/Notifications: We have to implement Alerting/Notification mechanism. If any error occurs it should trigger a mail including error description,job name, component name etc.

Documentation: This is also one of the important task . Need to document the mappings in any format. These will be useful like a reference guide.

Thank you



Types of Facts in Data warehouse

Types of Facts in Data warehouse

What is Fact Table?

Fact Table consists of Measures/Metrics/Facts of  a business process. Typically it contains two types of data. Foreign keys which are holding the primary key of Dimension Tables and the second type is the sales,revenue, unit of product etc. measures.


Measures Types:

Additive:  The measures which can be added across all the dimensions is known as Additive Measures.


In the above table , you can add the sales amount with respect to Product,Customer and also by Date.

Semi-Additive: The measures which can be added across some of the dimensions is known as Semi-Additive.


Department wise we can calculate the total number of employees in that organization. But as per the Date you cannot say total employees in Sales department is 29+35=64 which is a wrong data. You can calculate total number of employees on 20150901 as 35+150=185. So the measures are possible only for some of the dimensions.

Non-Additive: The measures which cannot be added across any dimension is known as Non-Additive.


We cannot calculate total profit margin with any of the dimension as shown in above table.

Fact less fact: This is not another type. It is like it may happen in real time that the fact will not store any measure kind of data. It is like set of events information. This is known as Fact less fact.


In the above table, there is no fact, All the three columns derived from dimensions only. There is no facts in this scenario but we can retrieve the information how many students attended Java course on particular time etc.

Fact Types:

1)Transactional: The most common type of fact table, and each record is an event of transaction which will involve most of the dimensions. So it contains robust data, which enables maximum slicing and dicing of the data.

2)Periodic snapshots: The measurements occurring over a standard period, such as a day, a week, or a month. These fact tables are uniformly dense in their foreign keys because even if no activity takes place during the period, a row is typically inserted in the fact table containing a zero or null for each fact.

3)Accumulating snapshots : The measurement events occurring at predictable steps between the beginning and the end of a process. e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process.

Thank You


Types of Dimensions in Data warehouse

Types of Dimensions in Data warehouse

What is Dimension?

Dimension table contains the data about the business. The primary keys of the dimension tables are used in Fact tables with Foreign key relationship. And the remaining columns in the dimension is normal data which is the information about the Objects related to the business.

Eg: Product,Customer,Orders,Company,Date etc.

Below are the different types of Dimensions:

1) Slowly Changing Dimensions (SCD) : Dimensions that change very slowly overtime rather than according to regular schedule.

Below are some popular approaches of SCDs:

Type 0: This is Passive method. The records which were inserted will not change any time.

Type 1: This method overwrites the old record with new values without tracking any historical data.

SCDtype1 SCDtype1_2

Type 2: This methods tracks the historical data by using version number or by using startdate, enddate columns.The drawback is the table will grow vertically. And it requires more space.


Type 3: This method will also track the historical data by inserting the column with new value. It preserves limited history.


Type 4: This method uses a separate table for storing all the historical data and main table will hold the current_data. And both will pointed with the same surrogate key.


Type 6: This is called hybrid method which uses all type 1+ type 2 + type 3. It uses flag values and effective date to differ between old and new records.

2) Rapidly Changing Dimensions: The Dimension which contains rapidly changing attributes. If we maintain any historical data for these type of tables. We will definitely get an issue related to memory and performance.  The solution is to maintain mini dimension tables for historical data  like type 4 Dimension in SCD. The main table should contain the current values and mini dimensions can contains historical data.

3) Junk Dimensions: In the data warehouse design we will come across a situation to use flag values. We can use one single table for this so that in Fact table no need to have multiple columns to store the Primary key values of these flag tables.


As per above, if we consider Gender_Marital_Status we can use only 1 single column in Fact table.

4) Inferred Dimensions: The Dimension which is important to create a fact table but it is not yet ready, then we can assign some dummy details for one ID and we can use that ID in fact table. After getting the details then we can update the details in the dimension.

5) Conformed Dimensions: Dimensions which are connected with multiple fact tables are know as conformed Dimension.

Eg: Customer Dimension and Product Dimension are required in Shipment Fact,Sales fact and Service Request Fact

6) Degenerate Dimensions: A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension.

Eg: Invoice_Number column in Fact table


7) Role Playing Dimensions: The same dimension which can be used for multiple purpose

Eg: Date Dimension can be used as Date of Sale, Date of Hire etc.

8) Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension

9) Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Thank You


Processing Multi schema files in Talend

Processing Multi schema files in Talend

Some times, we will get the input files which contains the data about different schema format but in single file.  We can process these files to get the data schema wise by using tFileinputMSDelimited.

Raw Data:

Here is a sample raw data file. In which it exists 3 different schema types.


If you observe the first column. It represents the type of that data

ORD means it is Order details

CUS means it is Customer Details

and TRS means it is Transactions.

All these 3 exists in same file. But our requirement is like to have all these 3 types of data in separate files for each type.


Create the below job to get separate files/outputs for each schema type

In the palette you will find tFileInputMSDelimited component drag it on to the work area.

Now open component tab to configure it as shown below:


Click on Multi Schema Editor And configure as shown in below screen shot:


If the column separators are different for each schema you can enable Use Multiple Separators option.

After getting the preview click on the Fetch codes


Now take 3 output components which ever you need it. I selected tLogrow components to display the result on the console.

Execute the job:

Job exec

Now check the output you will get three different tables based on the schema type.



Thanks and Regards,


Why build a Data warehouse?

Why build a Data warehouse?

Often when we talk about implementing BI, the first thing we need to look at is how and where is the data? Is it scattered? Is it easy to report on? With this we often need to know if we need to build a reporting database or a data mart or for that matter a data warehouse. We all know WHAT a data warehouse essentially is; here we are going to discuss the WHY.

Consolidate/Integrate Data:

Usually the data in an enterprise is scattered across the various applications and data sources that form the enterprise’s ecosystem. To aid better insights and business decisions, the business users need data from all the various subject areas to be accessible. DWH helps us to consolidate and integrate data from various sources and store in a single data model.


Cleanse data & Enhance data Quality:

DWH includes converting data from different sources into common formats, using common keys, structure, decimal/time formats. We can restructure the data and rename tables and fields so if makes more sense to the users. We can also use master data management to consolidate common data from various sources and build relationships.


Store Historical Data:

DWH is usually used to store huge amounts of historical data that the transactional systems do not /cannot retain. This helps analytics across different time periods, trend analysis.


Speed up Data Retrieval:

In a DWH, the data is structured in such a way that aids faster retrieval. The structure is designed for faster data access, drill down, filtering, and summarization. The transactional systems usually optimized for smaller transactions that inserts/update data in specific tables and access small amount of data. However, BI requires analyzing, aggregating a large amount of data quickly.DWH stores data to help such retrievals.


Unburden transactional Systems:

Executing resource intensive BI queries that access huge amount of data on transactional systems may affect these systems. It may even bring them down affecting business. Building a DWH separate from the transactional systems solves this problem.


Prepare / Pre-aggregate data:

Depending on the business needs, data can be pre-aggregated at various levels and stored in the DWH. This can save time and improve performance of the BI reports. The KPIs that help the business users can also be pre-calculated and stored. Users can then analyse these indicators across dimensions to gain knowledge and insights.


Support Self Service BI:

The DWH either uses dimensional model or easily feeds dimensional models, which can be used to build analytical cubes. These cubes can be used by business users to do Adhoc reporting, slice-n-dice the data and build their own reports.


Increased findability for Business users:

In a DWH, the data is stored in ways that is more meaningful to the business users. Also, the data relations can be easily established. The table names, field names are business user friendly. These make it easier for users to find the data they were looking for.


Helps Data mining / trend analysis / Predictive Analytics:

Once a DWH is created, it makes easier to analyse huge amounts of data and associations to uncover hidden patterns using data mining methodologies. The historical data in DWH can we analysed across time periods to discover trends / perform business predictions.
These are some major reasons / benefits of building a DWH. Though building DWH is a time consuming and complex process, it solves many problems and delivers effective BI.

Shraddha Tambe | Helical IT Solutions

Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded

Different causes for OOM

Every one in java development face java.lang.OutOfMemoryError now and then, OutOfMemoryError in Java is one problem which is may be due to

  • Improper memory management by the Java code
  • Insufficient hardware memory available to process the code
  • Programming mistakes
  • Memory leak

Types of OOM

Any of the above reason can result into OutOfMemoryError. Following three different types of OOM errors are common

  1. Java.lang.OutOfMemoryError: Java heap space
  2. Java.lang.OutOfMemoryError: PermGen space
  3. Java.lang.OutOfMemoryError: GC overhead limit exceeded

OOM in Talend

From above the last error I observed during flat file data migration to database using Talend Open Studio. File sizes were in the range of 10 MB-500MB in size. Initially job worked well, but when started with larger files of around 100MB, this error popped up. Hardware memory (RAM) available was 16 GB. But default memory allocated by Talend was Xmx1024m (1GB). There was a possibility to increase Xmx to 10240m which could have solved the issue, but this GC overhead limit exceeded was related to garbage collection. After searching for the issue, I came across a very descriptive article related to garbage collection at


Following workaround solved the problem in Talend without increasing the memory limit to higher figures.

Add new option to Windows–>Preferences–>Talend–>Run/Debug  – XX:-UseGCOverheadLimit

Installation of Pentaho 5.0.1 CE with Postgresql db (in Windows 7 / Ubuntu / AWS)

This blog will elaborate about Installation of Pentaho 5.0.1 CE with Postgresql db (in Windows 7 / Ubuntu / AWS)

1.Putty configuration
Step-1: Go to start menu and click putty or Desktop find where is putty.
Step-2: Just provide IP Address or host name of client machine. See the below image and provide Host name in red rectangle area.
Putty configuration

Step-3: After that go to connection tab > Data and provide username of client machine. In below image you can see and follow the steps.

Putty configuration 2

Step-4: After that browse and upload the .ppk file.In below image you can see the procedure how to go there and  provide .ppk file, and click Open button.
Putty configuration 3

2.Java Configeration
PART- I JAVA (Open – JDK)- Installation

Installation of JAVA(Open JDK) in ubntu server
NOTE : You must issue sudo(Superuser do) permission before you ready to install java
Command : sudo su
In the next line : You need to issue the password that you used for log-ing using PuTTY.

* Install Open JDK and JRE using the below commands
1) JAVA installation command.
apt-get install oracle-java7-installer

2) Find whether java installed or not
which java

3) Find which version installed
java -version
java version “1.7.0_51”
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

You need NOT to set JAVA_HOME if you see the above message(when you issue java -version) for ubntu server, at the time of java installation it will set the JAVA_HOME.
This may differ from version to version for ubntu server.

By default the open JDK install in the below location

NOTE-3 ( Optional – set java_home)

Open .bashrc file using an editor. If you use VI then
vi ~/.bashrc
and add the following 2 lines in your .bashrc file.
export JAVA_HOME
3.Pentaho Bi-Server Configuration
PART- II Pentaho 5.0.1 CE installation
1) Create a folder under home
mkdir pentaho
2) Download pentaho
Issue the below command to down load pentaho zip file from sourceforge site.
wget Intelligence Server/5.0.1-stable/

3) Unzip file

i) Once the down load is completed you will be getting “”  zip file under  pentaho
ii) unzip the file using below command
NOTE: if unzip is not installed in ubuntu server issue the below command to install unzip package
sudo apt-get install unzip
4) biserver-ce  folder
i) Once unzip completed you will be getting a folder biserver-ce
ii) Do the following.
root@———–#/opt/pentaho# cd biserver-ce
root@———-#/opt/pentaho/biserver-ce# ls
data       pentaho-solutions  set-pentaho-env.bat  start-pentaho.bat  stop-pentaho.bat  third-party-tools
import-export.bat  licenses          promptuser.js   start-pentaho-debug.bat   tomcat
NOTE : You need to give the permissions to the .sh files to run
Issue below command : chmod -R 777

5) Start the bi server using below command

[root@ip-10-80-147-44 biserver-ce]# ./
DEBUG: _PENTAHO_JAVA_HOME=/usr/lib/jvm/jre
DEBUG: _PENTAHO_JAVA==/usr/lib/jvm/jre/bin/java
Using CATALINA_BASE:   /opt/pentaho/biserver-ce/tomcat
Using CATALINA_HOME:   /opt/pentaho/biserver-ce/tomcat
Using CATALINA_TMPDIR: /opt/pentaho/biserver-ce/tomcat/temp
Using JRE_HOME:        /usr/lib/jvm/jre
Using CLASSPATH:       /opt/biserver-ce/tomcat/bin/bootstrap.jar

NOTE: stop the server by issuing this command
root@ —— :/opt/pentaho/biserver-ce# ./
root@ONE-DMZ-SRV161:/home/pentaho/biserver-ce# ./
Using CATALINA_BASE:   /opt/pentaho/biserver-ce/tomcat
Using CATALINA_HOME:   /opt/pentaho/biserver-ce/tomcat
Using CATALINA_TMPDIR: /opt/pentaho/biserver-ce/tomcat/temp
Using JRE_HOME:        /usr/lib/jvm/jre
Using CLASSPATH:       /opt/biserver-ce/tomcat/bin/bootstrap.jar

6) Go the web-browser and give the below URL
Example :

NOTE: pentaho by default uses 8080 port number.
Note: Below process is applicable for window machine not linux machine.
If you want to change default port number then follow steps like D:\Software\biserver-ce-5.0.1-stable\biserver-ce\tomcat\conf\server.xml   Find the port no. 8080 and replace port number according to requirement and save the file and exit.

4.Pentaho Home Page
Credentials you need to supply once you get the home page are :
User name : Admin
Password : password

Installed Pentaho

7) After that click Home and go to market place and Install Plugging like according to requirement     like Community Dashboard Editor (CDE),Community Dashboard DataAccess, Community Dashboard Framework, Saiku Analytics etc.
8) And once stop the Bi-server and again restart server server.

Rohit Verma

Helical IT Solutions

D3 Calendar View Chart in Jaspersoft / iReport

This blog will talk about what is calendar view chart, its usage and about how to create and integrate a D3 Calendar View Chart in Jaspersoft / iReport

All the reports are develop using ireport 5.5 professional and jasper server 5.5

As html component of jasper server does not load any scripts in the html component, we loaded the script in one of the decorator page(jsp page). The page is located at the location:


* Note: The above problem is specific with Jaspersoft 5.5 and not with the previous versions.

In the page we included the scripts which we want to load. We added the following code in the jsp page at line no 46:

<script type=”text/javascript” language=”JavaScript” src=”${pageContext.request.contextPath}/scripts/d3.v3.min.js”></script>

The script to be added should be kept at location:


In this blog we will be discussing two d3 charts I developed in jaspersoft.


Calendar View:

Calendar view is basically used to show the action or value of a parameter across an entire year. There are 12 blocks which shows all the months, then every month is having subsequent sub-blocks which stands for dates in that month. Color gradient is also present for different dates (which is also configurable), thus depicting the intensity of the parameter across the entire year.

Calendar View graph in D3 present information using two sets of the SVG elements: (1) a set of dates and (2) Some value associated with each date. The color of each date element can be set according to the value on that particular date.

D3 Calendar View Chart iReport


Integration with JasperServer:

The data which we use for developing the calendar view can be fetched from any database. The data fetched from database is stored in a variable and is then accessed in the html component using the same variable. Applying this of process makes the report dynamic instead of static. Few parameters can also be added in the report which can be used in query and/or html component.


Generally for these type of charts we pass a variable which contains required data containing date and a value associated with it. The string is created according to JSON format, so that when accessed in script tag, can be easily converted to JSON object using eval function.


Any variable/parameter can be accessed as shown below:

“<script> var arr =”+$V{variable1}+” </script>”

Parameter in query:

Select * from table_name

where date between $P{parameter1}  and $P{parameter2}


The sample code of static chart can be found at:

The steps on how to integrate it with jasperserver was discussed in my previous blog(D3 Integrating with Jasperserver).

Create Organization using REST Webservice in Jasperserver / Jaspersoft

This blog will talk about how to create organizations using REST webservice in JasperServer.

STEP 1:-

Put the following jar files in lib folder:










STEP 2:-

Create a jsp page




<meta http-equiv=“Content-Type” content=“text/html; charset=ISO-8859-1”>

<title>Insert title here</title>



<form action=TestWebService>

<input type=“submit” value=“submit”>



STEP 3:=

Create client :


package com.helical.restwebservices;









import java.util.List;

import javax.servlet.http.HttpServlet;

import javax.servlet.http.HttpServletRequest;

import javax.servlet.http.HttpServletResponse;

import org.apache.http.HttpResponse;

import org.apache.http.client.HttpClient;

import org.apache.http.client.ResponseHandler;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.client.methods.HttpPost;

import org.apache.http.client.methods.HttpPut;

import org.apache.http.entity.StringEntity;

import org.apache.http.impl.client.BasicResponseHandler;

import org.apache.http.impl.client.DefaultHttpClient;


public class TestWebService extends HttpServlet{

private static final long serialVersionUID = 1L;

public void doGet(HttpServletRequest req , HttpServletResponse res)


String jSessionId;

jSessionId = jWSLogin(““, “test”, null);




public String jWSLogin(String username, String password, String organization)


if(username == null || password == null)


throw new RuntimeException(“Failed: Username or Password can’t be null”);


String j_uname = organization == null ?  username : (organization.trim().length() == 0 || organization == “”) ? username : username +”%7C”+ organization;


String j_password = password;

String jSessionIdCookie=””;

try {

URL url = new URL(““);

HttpURLConnection conn = (HttpURLConnection) url.openConnection();



conn.setRequestProperty(“Content-Type”, “application/x-www-form-urlencoded”);

String input = “j_username=”+j_uname+”&j_password=”+j_password;

OutputStream os = conn.getOutputStream();                os.write(input.getBytes());


if (conn.getResponseCode() != 200) {

throw new RuntimeException(“Failed : HTTP error code : “+   conn.getResponseCode());




BufferedReader br = new BufferedReader(new InputStreamReader(


List<String> cookies = conn.getHeaderFields().get(“Set-Cookie”);


         for (String string : cookies)


jSessionIdCookie = string;




} catch (MalformedURLException e) {


} catch (IOException e) {



  return jSessionIdCookie;


public void addOrganization(String jSessionId, HttpServletResponse res)


HttpClient httpClient = new DefaultHttpClient();


HttpPost request = new         HttpPost(““);

request.setHeader(“Cookie”, jSessionId);

StringEntity params =new StringEntity(“{\”id\”:\”helicaltest123\”,\”alias\”:\”helicaltest123\”,\”parentId\”:\”organizations\”,\”tenantName\”:\”helicaltest123\”,\”tenantDesc\”:\”Audit Department of Finance\”,\”theme\”:\”default\”}”);

request.addHeader(“content-type”, “application/json”);


HttpResponse response = httpClient.execute(request);

}catch(Exception e)



}finally {







We have created two methods in

  1. jWSLogin()
  2. addOrganization()

jWSLogin() is use to aunthenticate and return session so that we can use that session in another task.


is use to add  organization. Data which we want to insert should be in json or xml format. In this example I have taken in json format.


STEP 4:-

Create web.xml:=

<?xml version=“1.0” encoding=“UTF-8”?>

<web-app xmlns:xsi= xmlns= xmlns:web= xsi:schemaLocation= id=“WebApp_ID” version=“2.5”>







<servlet-name> TestWebService </servlet-name>

<servlet-class>com.helical.restwebservices. TestWebService </servlet-class>



<servlet-name> TestWebService </servlet-name>

<url-pattern>/ TestWebService </url-pattern>




For any other help on Jasper Server, please get in touch with us

Helical IT Solutions