Helical Insight for ETL Tester (QA)

Helical Insight for ETL Tester (QA)

Let’s take an example that your ETL developer has just wrote an ETL script, which loaded the data into the Data Warehouse, few millions of records got inserted into the Data Warehouse.

For QA, it is super difficult to test all (millions) of the data and also a time consuming process. Moreover, there are many other problems like

  • some might not be well versed with writing complex SQL queries
  • There might be multiple data sources involved in generating the Data Warehouse.
  • Writing multiple queries on multiple data sources and then combining it in one single result set would be a sweat breaking task.

Now think of Helical Insight as your friend which helps you to do most of your tasks, allowing you to focus on more critical thinking rather than just doing stare and test whole day. Wondering how Helical Insight can help you achieve the same?

Here is your answer,

  • Helical Insight is capable of communicating with any of your data source(s) like CSV, Excel, MySql, Oracle, SQL Server, Postgres etc.
  • Helical Insight has a drag and drop feature to make your test cases ready, without even having a prior knowledge of SQL queries. Helical Insight generates the SQL queries including the joins between multiple entities. Helical Insight understands your data structure, hence it generates the query even if you are working with different set of data source(s), DW or normalized database, Helical Insight understands all.
  • Using this (drag and drop) facility you can generate all your test cases in a form of report, which you can call it as your test cases. For example, your report does a count of records for a dimension. You can save this report and check the same with source system by creating a separate report. If its a migration project the count should definitely match otherwise we know something is wrong with the ETL Script.
  • Once you have saved these reports (test cases) you can re-executes them again an again to verify your data. In the above situation you have just identified that Helical Insight is really a very useful tool with some really cool features like generating the SQL query for any database or data source(s), ability to execute the queries on any databases or data source(s), ability to save the report / test cases / SQL queries for future usage.

10 Key Performance Indicators to check for quality data after ETL

10 Key Performance Indicators to check for quality data after ETL

Have you ever tested the data considering all the options or scenarios for data leakage? Imagine you have a source data which is converted/migrated by ETL tool or some process into the target

Have you checked your source and target data

Data testing

data. This conversion or migration process continues on day to day basis for months and years. How many times you tested for the correctness of the source and target data as per the business rules and expectations? There are chances of wide gap between the source and target. This article shows how data can leak and still show that the ETL is executed correctly without any error. Again many factors are also dependent on ETL tool. This article shows 10 key performance indicators to check for quality data after ETL. I try to focus all these aspects independent of ETL tool.

ETL Key Performance Indicators for data testing

 

1)  Source reject – Assume that developer created proper metadata for source. But the source contains data which differs with the metadata. In this case, the data will be outright rejected from the source. It will not come into the mainstream for ETL processing. Nobody will come to know that it was rejected from the source. It should be zero.

2) Target reject – Same as above, developer created a metadata for target. But the processed data contains different data which does not match with the target metadata. The data is rejected before inserting in to the target. If you don’t capture, it is lost. It should be zero.

3) Source is having higher data – When you compare source data and target data after completion of ETL, you may see that source is having higher records than target, reason can be anything. It should be zero when compared with the target.

4) Target is having higher data – This scenario exists when source data is deleted but still it exists in the target. This needs target data updating at the time of regular ETL. It should be zero when compared with the source.

5) Source data mismatches with the Target – This happens during ETL, some transformation is applied on source or during migration from one data source to another, it would automatically convert the data and insert into the target. In short, source is different than target. Data mismatch need to be verified and this count should be zero.

6) Target data mismatches with the Source – This happens when the data update takes place on source, but not changed in the target. Data mismatch need to be verified and this count should be zero.

7) Timestamp issue – This is one of the most promising issues during ETL process. When data comes from different time zones, collected at central place and then brought to the staging for Warehousing. Usually timestamp is used for incremental load, this timestamp creates a problem which skips data for specific time zone

  1. Another issue in time zone is timestamp – All of the databases do not support the entire time stamp format. While converting data from one database to another database, this time conversion needs to be taken care of.

8) Time lag during ETL – Many times ETL job runs for few hours on every day basis. Need to have proper signature for ETL completion. If the signature of ETL completion is not proper and if some data insertion or update happens during the period of ETL, then certainly at consecutive execution the job is going to skip few records. Need special method to track the source and target data.

9) Null handling and metadata conversion – Every time when ETL is executed, proper care for null handling and metadata conversion is to be taken care. Need schema compliance check.

10) Monitoring – After completion of ETL it is always required that proper auditing and logging is implemented. This is required for ensuring that everything is working as expected. All the above points and other items need to be logged during ETL project development. Need reporting bursting on the audited and logged data.

In addition to this there are many activities and practices to follow to ensure that the data movement from source to target is as expected. If it is not ensured then, it directly reflects in the data loss which is a revenue loss for a company. Because, if the source data for the dashboard is wrong, it would be shown as it is in the Dashboard or report. This is false and partial data and not corrects figures. Many times, this is not identified unless it is compared with some other data. This comparison is a far remote chance in today’s fast food and high speed data requirement. In many cases error is not generated during ETL process, which gives false notice that the ETL is working fine.

Are you doing all of these during your ETL project development? If not, ensure that you do all these next time when you start with your ETL project. Whether you are a client or a developer or business owner, if you are moving data from one location to other location, try to ensure all of above points are checked or implemented during ETL activity. Or contact us for quality ETL development.

Vaibhav Waghmare , DW Architect, LinkedIn