CDC Data to Azure Synapse: 2 Easy Methods

• May 22nd, 2023

CDC Data to Azure Synapse FI

“If information is power, then immediate access to information is turbo power.”[1]

Real-time is the way ahead. This is why change data capture has become such a basic need for the modern data stack.

Change Data Capture bridges the data-driven value chain by driving data ingestion from multiple sources, then transforming it on the go and migrating it to a cloud data platform.

In this article, we’ll go over a couple of handy methods to help you CDC data to Azure Synapse with ease. Let’s dive in!

Change Data Capture Overview

Change Data Capture is a design pattern that lets users detect changes at the data source and then apply them across the board. Actionable data is derived from a vast multitude of sources like machine logs, social media, log files, etc. With Change Data Capture, you won’t miss any opportunities for real-time insights as it consistently captures data changes and updates from transactional sources.

It also creates a smooth flow and improves the system’s reliability. This is particularly pivotal for data warehouse/cloud architectures with a constant flow of data. You can dynamically modify the data stream to align with schema changes, which comes in handy when dealing with different types of data coming in from live data sources.

In a modern data architecture, people can continuously ingest CDC data into a data repository of their choice through an automated data pipeline. It can transfer data to message queues to analyze the streaming log data.

To maximize the impact of CDC, you can couple it with streaming ingestion to address real-time analytics use cases. For instance, real-time inventory analysis and fraud detection.

Change Data Capture Techniques

Here are a couple of nifty Change Data Capture techniques that you can use to extract maximum value from your data:

  • Trigger-based: You can build CDC at the application level by defining triggers and creating your change log within shadow tables. Implementing triggers will be dependent on the database where the implementation is to be carried out. However, trigger-based Change Data Capture has a couple of shortcomings too. It might negatively affect the application performance, increase the processing overhead, and grind source production operations to a screeching halt.
  • Log-based: Generally, every DBMS has a transaction log file that records all the database changes made by every transaction. Every DML operation is captured in a log file in a database, alongside the timestamp or a database-specific unique identifier that shows when every operation was executed. This technique uses this log information to identify the changes and execute CDC operations. 
  • Timestamp-based: This technique depends on a timestamp field in the source to identify and pull the modified data sets.
  • Snapshot-based: You can create a complete extract of data from the source table in the target staging area by leveraging the snapshot-based change data capture technique. So, the next time the incremental data needs to be loaded, a second snapshot or version of the source table is matched to the original one to pick any changes. 

Why should you CDC data to Azure Synapse?

  • Virtually Limitless Scalability: Azure Synapse is a cloud-based platform that can organize and query relational and non-relational data significantly faster than an on-prem system. Unlimited concurrency and workload isolation for critical workloads optimizes performance.
  • Robust Data Collaboration Capabilities: It natively integrates with Azure Data services, allowing your team to share the data in your data lakes/ data warehouses. This eliminates the need to replicate large volumes of data from multiple enterprise databases since all the data is available in a single robust platform. 
  • Native Integration with Power BI: Azure Synapse like Power BI is a part of the Microsoft suite. This means that you get better visualization and business intelligence capabilities with your data. You’ll find it easier to report key metrics, and analytics allows you to extract maximum value from your data.
  • Intuitive UI: Your decision makers will have a field day with Azure Synapse’s visual user interface as it gives them complete visibility into the work of data practitioners, across warehousing, BI, analytics, and machine learning. Since there is little to no coding involved here, the learning curve is pretty gentle for even non-technical users. 

Methods to CDC Data to Azure Synapse

In this section, we’ll go over two simple methods to CDC data to Azure Synapse — using Azure Data Factory and using a third-party tool, Hevo.

Using Azure Data Factory to CDC Data to Azure Synapse

You can start using Azure Data Factory to CDC data through the factory-level Change Data Capture resource.

  • Step 1: From the main pipeline designer, click on ‘New’ under the Factory Resources to start a new Change Data Capture process.
  • Step 2: The next step in this configuration walk-through experience is choosing your sources and destination (Azure Synapse for this scenario). Apply the transformations you want for your data, and then click ‘Start’ to begin your data capture. 
  • Step 3: In this method, you don’t have to worry about designing data flow activities or creating pipelines. You’re only charged for 4 cores of General Purpose Data Flows. Next, you can set up a preferred data latency, which ADF will use to periodically look for any changes in your data.

You can automatically detect and pull the modified data through the ADF mapping data flow from the source databases.

You don’t need any ID or timestamp columns to identify the changes. ADF will simply chain a source and a sink transform to a database dataset within a mapping data flow. This’ll allow you to see the changes in the source database. These’ll be automatically applied to the target database so that you can easily synchronize data between the two tables.

Here are a few best practices to keep in mind when pulling data from databases using ADF:

  • If you can’t find your database in the ADF connector list, you can use the auto incremental extract. Here, you’ll only need to input an incremental column to capture the changes. ADF handles everything else. It’ll create a dynamic query for delta managing and loading the checkpoint in every activity run.
  • The simplest way to capture changes in your sources is through native change data capture. This’ll be less burdensome for your source database when ADF extracts data to process it.     

Drawbacks of using Azure Data Factory to CDC Data to Azure Synapse

  • Azure Data Factory CDC has limited data transformation capabilities when compared to other tools in the market. You might need other tools to supplement or execute your data transformations.
  • Debugging and monitoring data pipelines can be tricky with Azure Data Factory as your go-to tool for CDC data to Azure Synapse. The issues become more prominent when errors occur. With Azure Data Factory, it can be difficult to zero in on what caused the issue. This might lead to delays in resolution and troubleshooting.
  • Even though Azure Data Factory offers a bunch of pricing options, the cost can shoot up when facing complex integrations or huge volumes of data. So, if you want to execute more complex integrations, you might be better off choosing a third-party data integration like Hevo.

Using Hevo Data to CDC Data to Azure Synapse

Hevo’s point-and-click interface allows developers to implement such a real-time data load with the lowest time to production possible. You can use Hevo to CDC data to Azure Synapse in three ridiculously simple steps:

  • Step 1: You can use CDC data in Hevo only when the source is a relational database. To query data from SaaS sources, Hevo uses the API of the respective source. The specifications of the Source API would affect how data gets queried for different objects in the source. Pick a source from the ‘Select Source Type’ window. 
CDC Data to Azure Synapse: Select Source in Hevo
  • Step 2: For relational databases that Hevo supports, you can choose different modes to query for the data to be pulled. For this scenario, we’ll choose ‘Change Data Capture’ as our query mode. For every object you ingest using Hevo, you can choose the query mode:
CDC Data to Azure Synapse: Query Modes
Image Source
  • Step 3: Next, you need to configure Azure Synapse Analytics as a destination in Hevo. On the ‘Configure Destination’ page, select Azure Synapse Analytics. In the Configure your Azure Synapse Analytics Destination page, fill in the following fields:
CDC Data to Azure Synapse: Configure Azure Synapse Analytics as a Destination in Hevo

And voila! You’ve set up a pipeline in Hevo to CDC data to Azure Synapse within a few minutes.

Build Your Data Pipeline for Free

Points to Remember:

  • In Azure Synapse Analytics, the tables in dedicated SQL pools are created with clustered columnstore index by default. This is because it decreases data storage costs and improves query performance. You need to keep in mind that any variable-length column in these tables cannot contain more than 8000 bytes of data. Therefore, any event containing data that exceeds this limit will be sidelined by Hevo. You can transform the data to reduce the data size and replay them.
  • A limitation here is that Hevo doesn’t support the replication of data to serverless SQL pools in Azure Synapse Analytics.

FAQs

What is the new Azure Data Factory Change Data Capture capability?

Change Data Capture is now offered as one of the factory resources on ADF. This means you can now easily configure change data capture from a set of clickthrough screens without needing to design a dataflow or pipeline. This’ll support JSON, Avro, Parquet, XML, SQL Server, Azure SQL Database, etc. as sources. For more information, you can read Microsoft’s documentation on ADF change data capture.

What is the difference between Change Data Capture and Change Tracking?

The primary difference between change data capture and change tracking is that change data capture will maintain a history of row changes, including the actual data that was changed. 

Change tracking, on the other hand, will only store the most recent changes to every row. It won’t maintain any records of change history.

Change data capture usually happens asynchronously. It reads the database’s transaction log to pinpoint when a change has occurred. 

However, change tracking works synchronously.

This means that change information is refreshed in real time, lining up with the addition of that change to the database.

What is the difference between Change Data Capture and Triggers?

The difference between CDC and triggers lies in their impact on database performance. Change Data Capture uses the changes applied to the logs instead of the actual database, therefore, having no impact on the database performance.

Triggers will impact database performance through locks, resource sharing, etc.

Conclusion

A crucial component of modernizing and mastering your data is understanding the role CDC can play in extracting actionable insights from your data. CDC speeds up your decision-making process, boosts the time to value while reducing the total cost of ownership, and reduces the impact on production databases.

With Hevo at your helm for CDC, companies can ingest and replicate a large volume of their data from 150+ sources to a destination of their choice for processing and analytics. Forget the hassle of building and maintaining data pipelines from scratch, as Hevo takes care of it so that you can focus on serving your customers.

For the rare times things do go wrong, Hevo ensures zero data loss. Add 24*7 customer support to the list, and you get a reliable tool that puts you at the wheel with greater visibility. Start modernizing your data stack by scheduling a demo now!

References:

No-Code Data Pipeline for Your Data Warehouse