Change Data Capture(CDC)
- Change data capture (CDC) is the process of capturing changes made at the data source and applying them throughout the enterprise.
- CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes.
- The goal of CDC is to ensure data synchronicity.
There might be requirements, where we need to compare a data source against an exact replica of the last known state of the data in the data source while performing an ETL in AWS Glue.
Scenario: We should load the data from source database to destination database, by comparing it with the data that was loaded in the previous ETL. We should be finding the inserts, deletes, and updates.
This can be done through the following steps.
Assuming, we are using AWS Glue, fetch the data using a dynamic frame, convert it to spark data frame and perform the SQL operations.
- The data from the source table is loaded to a source data frame, say SRC
- The data from the previous ETL is loaded to a mirror data frame, say MIR
- The table used should have a primary key
Mirror: This holds the data loaded in the previous ETL. This is used to compare against the source. This is updated at the end of each ETL.
Insert logic:
The number of records inserted can be obtained using left-outer join operation performed on the source data frame and mirror data frame where the primary key is null in the mirror.
INSERT = SRC LeftOuterJoin MIR where MIR.primary key is null
Delete logic:
The number of records deleted can be obtained using right outer join operation performed on the source data frame and mirror data frame where the primary key =null in the source.
DELETE = SRC RightOuterJoin MIR where SRC.primary key is null
Update logic:
The number of updates are obtained from two SQL operations performed on source and mirror data frames.
- Subtract the mirror data frames from source data frame, this results in the delta data frame
- Determine the number of Inserts using LeftOuterJoin operation as mentioned above.
- The updated records can be obtained using left-outer join operation performed on delta and Insert where the primary key of the is null in the Insert
DELTA = SRC - MIR
INSERT = SRC LeftOuterJoin MIR where MIR.primary key is null
UPDATE = DELTA LeftOuterJoin INSERT where INSERT.primary key is null
This helps us to identify the insert, delete and update records
In case if you have any queries please get us at support@helicaltech.com
Thank You
Sai Kavya Sathineni
Helical IT Solutions Pvt Ltd
Best Open Source Business Intelligence Software Helical Insight Here
A Business Intelligence Framework
Best Open Source Business Intelligence Software Helical Insight is Here