Change Data Capture (CDC) – Methods, Approaches, Alternatives

Change data capture :

It’s the process of tracking the data that has changed so that corresponding action can be taken. Its nothing but capturing the changes which are made on the data source and applying it at all the other places, and hence it reduces the efforts required at the ETL step.

 

Different strategies :

  • Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process.
  • Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model.

 

Usage of Change Data Capture:

Change data capture is often used in data warehousing since it involves extraction and transformation of data from one or more databases to data warehouse for analysis.

Prior to CDC, the change data was captured using other methodologies i.e. table difference or table comparison. Its like taking a copy of the source and comparing with target, spotting the difference. For limited number of records this approach might work, but for more number of records this methodology will be highly inefficient and result in immense overheads.

 

Table differencing: It involved copying entire table from source to staging. But since, the entire table was shifted, there were additional overheads and this method was not efficient. Also, in case if there are multiple changes in the data, it cant be captured using this methodology.

 

Change value selection: This involves capturing source database changed values based on the value of a specific colum. However for implementing this we need to fire queries to find out the changed data which is an additional overhead. Again, in this method also we will not be able to find or capture the intermediate values. Also the changes happening between firing of two queries will not be captured.

 

A slightly better approach than the ones mentioned above is by using database triggers. This triggers can be used after any Insert/Delete/Modify statement whenever there is any change, the changed information will be passed on to a table. The only disadvantage of this solution is it is database specific and migration of such a solution becomes complex since its db specific.

 

Another approach would be the usage of log readers. Many Enterprise DBMS have capability of transaction logs that can be used (can be switched on and off), the amount & details of information in log to be stored can be decided as well. By reading this we can come to know of the changes happening in the system. The problem with this is purchasing of new component for log reading, migration might be an issue, also some DBs might not have the logging features etc. Also sometimes some changes which are not committed, their logs are also created which could be as well an issue.

 

Capturing data with CDC:

CDC can capture data using two modes :: Synchronous and Asynchronous. In synchronous mode, any changed data in source system is informed immediately to the target using DML (data manipulation language) operations, whereas that’s not the case with Asynchronous mode. Synchronous uses triggers and it results in no latency or delay, though it has more overheads as compared to asynchronous.  
Benefits of CDC

–          CDC can improve performance and reduce redundancy by copying only the changed data.

–          Log based CDC also reduces load on operational systems since it only reads the logs rather than db

–          No latency, hence end business users can have immediate realization of the same in their BI

–          CDC can easily handle high volume transactions

 

We can implement CDC using Talend Integration Suite. More information about the same will be covered in another blog.

 

For any query, please reach out to me on [email protected]

MariaDB to MariaDB Data Replication Strategies

Data Replication Strategies

 

This blog will try to explain about the different approaches which can be used for the data replication from Master DB (Maria DB) to the salve Maria DB.

Replication Benefits

Replication is having a number of benefits, some of them includes:

  • Scalability: By having one or more slave servers, reads can be spread over multiple servers, reducing the load on the master. The most common scenario for a high-read, low-write environment is to have one master, where all the writes occur, replicating to multiple slaves, which handle most of the reads.
  • Low hardware : Having a slave system db can provide the freedom to play with that data without affecting master database. We can have all the reporting etc done on this particular database. Hence, having a very high hardware requirement on the master database can be avoided.
  • Data analysis: Analyzing data may have too much of an impact on a master server, and this can similarly be handled on a slave server, while the master continues unaffected by the extra load.
  • Backup assistance:  Backups can more easily be run if a server is not actively changing the data. A common scenario is to replicate the data to slave, which is then disconnected from the master with the data in a stable state. Backup is then performed from this server.
  • Distribution of data: Instead of being connected to a remote master, it’s possible to replicate the data locally and work from this data instead.

 

 

Approach 1 : Maria DB to Maria DB replication using Master slave configuration.

 

Data Replication

Replication is a feature allowing the contents of one or more master servers to be mirrored on one or more slave servers.

We can exert control over which data to replicate. All databases, one or more databases, or tables within a database can all be selectively replicated as well.

The main mechanism used in replication is the binary log. If binary logging is enabled, all updates to the database (data manipulation and data definition) are written into the binary log as binlog events. Slaves read the binary log from each master in order to access the data to replicate.

In this particular approach, masters and slaves do not need to be in constant communication with each other. We can take the servers offline or disconnect from the network, and when they come back, replication will continue where it left off. The slave in this case will be in access read only mode, the data copying from Master to slave will be on real time without any lag. Hence, on the BI and reporting front, we can have real time BI capabilities.

 

 

Approach 2 : Usage of Third party tools

The second approach which can be used for data replication from Master DB to Slave DB is by the usage of DB replication third party tools. One of the example which could be cited here is a tool like Tungsten. Tungsten is an open source third party data replicator tool. It not only works on MySQL but also on Maria DB. Tungsten can help solve problems like promoting masters easily from pools of slaves, replicating data between different database versions, replicating efficiently across sites, building complex topologies, and parallelizing data flow between servers. Tungsten Replicator runs equally well in cloud as well as locally hosted environments. Tungsten Replicator users range from tiny start-ups to the largest web properties on the planet.

 

 

Approach 3: using ETL. CDC implement, free version

The third approach which could be followed is to use ETL. We are having open source ETL options which includes Talend Open Studio or Kettle – Pentaho Data Integrator.

For having the master slave databases configuration implemented, we would be implementing CDC (change data capture). By CDC we can identify, capture and deliver only the changes which are made to the master database.  Frequency of the same can be set. Change Data Capture (CDC) is based on Publisher/Subscriber principle. The publisher captures the change data and makes it available to the subscribers. Implementing CDC will also help in reducing the workload on ETL since we would be sharing the information of only the changed data.

The advantages of implementing via ETL is security can be implemented like data encryption, while copying data we can have options like we can select which all tables data we want to replicate, how much amount of data we can to replicate, rules can be implemented while replicating etc.