Memory Group By Vs Group By in Pentaho DI
Group By and Memory Group By: These steps are useful to calculate aggregate values for the group of fields.
Even though both are useful for same purpose there are some differences and similarities as follows:
1) Group By works only when the input rows are in sorted order otherwise it will gives accurate result for the sequence records. So before using the Group By step the result should be sorted. Where as Memory Group by will handle non sorted records also.internally it will sort out and gives the output based on the selected aggregate option.
2) In Group By, there is an option to include all rows. If you enable this option it will give all the records, not just the aggregation records. to differentiate these two type of rows on output , we have to provide one boolean type of flag column. In Memory Group By we do not have this option.
3) In Group By, we have Temporary Files Directory option. If we enable include all rows then this option will be available. This option is to select the directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system. This is the option which allows us to store the data on disk so that if we have large amount of data we always prefer this one. And the Memory Group By step cannot handle huge amount of data as it always internally sorts out the data which will cause performance issue. So it will work fine for small amount of data only. So when the number of rows is too large to fit into memory, you need to use the combination of the Sort rows and Group by steps.
4) There is one more additional option in Group By step is, Add line number, restart in each group: Enable to add a line number that restarts at 1 in each group.
5) Always give back a result row : This option is same in both Group By and Memory Group By steps. If you enable this option, it will always give back a result row, even if there is no input row. This can be useful if you want to count the number of rows. Without this option you would never get a count of zero (0).
6) The field that make up the group: This option also works same in both the steps. After retrieving fields using the Get Fields button, designate the fields to include in the group.
7) Aggregates: Group By and Memory Group By , these both steps have same aggregate type functions. Specify the fields that must be aggregated, the method and the name of the resulting new field.
Here are the available aggregation methods :
- Average (Mean)
- Number of values (N)
- Concatenate strings separated by , (comma)
- First non-null value
- Last non-null value
- First value (including null)
- Last value (including null)
- Cumulative sum (all rows option only!)
- Cumulative average (all rows option only!)
- Standard deviation
- Concatenate strings separated by <Value>: specify the separator in the Value column
- Number of distinct values
- Number of rows (without field argument)
Best Open Source Business Intelligence Software Helical Insight is Here
A Business Intelligence Framework
Aggregate DI Difference between group by and memory group by ETL Group group by step needs data to be Group By vs Memory Group By in pentaho Memory Group By Vs Group By in Pentaho DI pentaho