Since Talend is a java-code generator, we can run jobs and subjobs in multiple threads to reduce the runtime of a job.
There are multiple techniques to execute the talend jobs in parallel.
Make data easy with Helical Insight.
Helical Insight is the world’s best open source business intelligence tool.
What is parallelization:
If there are multiple subjobs that are not dependent on each other, Talend executes the subjobs sequentially, ie., wait for one subjob to finish its execution to start another subjob. This process might take a lot of time to execute, depending on the number of subjobs to run. Hence, in talend jobs, the data flow can be partitioned into multiple threads. These threads execute in parallel so that there is a significant reduction in the runtime of the job.
Parallelization can be achieved in 3 ways.
- Enable multi-thread execution.
- Use tParallelize component(The tParallelize component is only available in the Enterprise Edition of Talend)
- Use parallel execution for execution plan(TAC).
This blog discusses parallelization using multi-thread execution option available in Talend Open Studio.
Enable multi-thread execution
This feature in talend, allows multiple jobs or subjobs to execute in parallel, provided they are not interdependent.
In the job tab of the job settings, enable the “Multi-thread execution” option provided to execute the subjobs in parallel.
Below is a simple job to demonstrate how parallel execution can be achieved by enabling “Multi-thread execution” option and how it behaves without enabling the option.
Sample job below has two subjobs which produce the timestamp at which the job started its execution and displays it using tLogRow component. In the first subjob, it produces the timestamp at which the job started and waits for 3 seconds using tSleep component and then displays it. And in the second subjob, it produces the timestamp and displays it immediately.
- Without enabling the Multi-thread execution option:
- After executing the job the current timestamp generated are :
- After enabling the Multi-thread execution option:
- After executing the job the current timestamp generated are :
The timestamp generated by the first subjob is 2018-10-10 13:06:47
The timestamp generated by the second subjob is 2018-10-10 13:06:50
Hence the second subjob is executed after the first subjob execution is completed.
The timestamp generated by first subjob is 2018-10-10 13:16:54
The timestamp generated by second subjob is 2018-10-10 13:16:54
Hence the second subjob executed parallel to the first subjob execution.
Make data easy with Helical Insight.
Helical Insight is the world’s best open source business intelligence tool.
** NOTE :
- If you’re going to run two SubJobs in parallel, then you need to consider the dependencies between these two SubJobs. There may also be subsequent SubJobs that are dependent on the completion of both of these two SubJobs.
- This feature is optimal when the number of threads do not exceed the number of processors of the machine you use for parallel executions. Else, some of the Subjobs have to wait until any processor is freed up.
Best Open Source Business Intelligence Software Helical Insight is Here
A Business Intelligence Framework
In Case if you have any queries please get us at support@helicaltech.com
Thanks
Rajitha
Helical IT Solutions Pvt Ltd