Change Data Capture(CDC) Capture Changes Made at Data Source

Posted on by By Sai Kavya Sathineni, in ETL | 0

Change Data Capture(CDC)

  • Change data capture (CDC) is the process of capturing changes made at the data source and applying them throughout the enterprise.
  • CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes.
  • The goal of CDC is to ensure data synchronicity.

There might be requirements, where we need to compare a data source against an exact replica of the last known state of the data in the data source while performing an ETL in AWS Glue.

Scenario: We should load the data from source database to destination database, by comparing it with the data that was loaded in the previous ETL. We should be finding the inserts, deletes, and updates.

This can be done through the following steps.

Assuming, we are using AWS Glue, fetch the data using a dynamic frame, convert it to spark data frame and perform the SQL operations.

  1. The data from the source table is loaded to a source data frame, say SRC
  2. The data from the previous ETL is loaded to a mirror data frame, say MIR
  3. Mirror:	This holds the data loaded in the previous ETL.
    This is used to compare against the source.
    This is updated at the end of each ETL.
    
  4. The table used should have a primary key

Insert logic:

The number of records inserted can be obtained using left-outer join operation performed on the source data frame and mirror data frame where the primary key is null in the mirror.

INSERT = SRC LeftOuterJoin MIR where MIR.primary key is null

Insert Logic
Delete logic:

The number of records deleted can be obtained using right outer join operation performed on the source data frame and mirror data frame where the primary key =null in the source.

DELETE = SRC RightOuterJoin MIR where SRC.primary key is null

delete logic

Update logic:

The number of updates are obtained from two SQL operations performed on source and mirror data frames.

  1. Subtract the mirror data frames from source data frame, this results in the delta data frame
  2. DELTA = SRC - MIR
    
  3. Determine the number of Inserts using LeftOuterJoin operation as mentioned above.
  4. INSERT = SRC LeftOuterJoin MIR where MIR.primary key is null
    
  5. The updated records can be obtained using left-outer join operation performed on delta and Insert where the primary key of the is null in the Insert
UPDATE = DELTA LeftOuterJoin INSERT where INSERT.primary key is null

updates

This helps us to identify the insert, delete and update records

In case if you have any queries please get us at support@helicaltech.com

Thank You
Sai Kavya Sathineni
Helical IT Solutions Pvt Ltd

logo

Best Open Source Business Intelligence Software Helical Insight Here

logo

A Business Intelligence Framework


logo

Best Open Source Business Intelligence Software Helical Insight is Here

logo

A Business Intelligence Framework

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments