How to Perform Snowflake Change Data Capture (CDC)?

Q: 3. What is the CDC process in Snowflake?

In Snowflake, CDC is implemented using Streams to capture changes in tables and Tasks to process or move the changed data. Streams track changes in a table (since the last query) and allow for further actions based on these changes.

In today’s world, data in an operational system changes quite frequently. Traditional Data Warehouses, on the other hand, do not contain up-to-date data. Up-to-date data is stored in operational systems before being put into the Data Warehouse at a set frequency. Snowflake is a modern Data Warehouse that has become an industry-leading Cloud-Based SaaS (Software-as-a-service) Data Platform. Snowflake Change Data Capture (CDC) makes it easy to detect new changes/data in a table for efficient processing.

This article is all about Snowflake Change Data Capture and how it keeps you updated with the current data changes. But before getting started with Snowflake Change Data Capture, let’s discuss Snowflake and its salient features in brief.

Table of Contents

What is Snowflake?

Snowflake is a modern Cloud Data Warehousing solution provided as a SaaS offering. It is built on Amazon Web Service, Microsoft Azure, or Google Cloud infrastructure that provides an unbounded platform for storing and retrieving data. Snowflake Data Warehouse makes use of a proprietary SQL Database Engine with a Cloud-specific architecture.

Snowflake requires no hardware or software to install, configure, or manage, making it suitable for enterprises that do not want to devote resources specifically to in-house server setup, maintenance, or support. Snowflake’s security and sharing features make it easy for businesses to securely exchange and share data in real-time using any ETL solution. Snowflake’s architecture allows for Big Data flexibility. When compared to other Data Warehouses in the market, Snowflake is noted for its scalability and relative ease of use.

Key Features of Snowflake

Below are the reasons responsible for the immense popularity of Snowflake.

Caching Paradigm: Snowflake employs a Caching Paradigm to quickly deliver results from the cache. It leverages Persistent Query results to avoid re-generation of the report.
Standard and Extended SQL Support: Snowflake supports most of the DDL and DML commands of SQL for querying data. In addition to that, it also supports advanced DML transactions, stored procedures, lateral views, etc.
Scalability: It is a well-known fact that Snowflake’s architecture separates its “Compute” and “Storage” units, thereby allowing them to scale differently. As a result, customers can pay only for the services used.
Security: Snowflake supports a variety of improved mechanisms for authentication. This includes Two-Factor Authentication and SSO via Federated Authentication.
Semi-Structured Data Support: Snowflake’s architecture supports the VARIANT schema on the Read data type that enables the storage of Structured and Semi-Structured data in the same area.

Migrating your data from MySQL on Amazon RDS to Snowflake doesn’t have to be complex. Relax and go for a seamless migration using Hevo’s no-code platform. With Hevo, you can:

Effortlessly extract data from 150+ connectors.
Tailor your data to Snowflake’s needs with features like drag-and-drop and custom Python scripts.
Achieve lightning-fast data loading into Snowflake, making your data analysis ready.

You can see it for yourselves by looking at our 2000+ happy customers, such as Airmeet, Cure.Fit, and Pelago.

Get Started with Hevo for Free

What is Change Data Capture (CDC)?

Snowflake Change Data Capture: CDC | Hevo Data

Change Data Capture (CDC) is an ideal solution to capture the near real-time movement of data in Databases. CDC refers to the collection of software design patterns used to detect and track data changes in a Database. It triggers the event connected with data, causing a specific action to be executed for any Change Data Capture. All companies require access to real-time Data Streams to perform efficient Data Analytics and CDC provides near-real-time movement of data by processing data as soon as new database events occur.

Events are captured and streamed in real-time using CDC and it helps to achieve reliable, low-latency, and scalable data replication in high-velocity data environments. It eliminates the process of bulk data loading by implementing incremental data loading. This way, Data Warehouses or Databases remain active in order to execute specific actions as soon as a Change Data Capture event occurs. Furthermore, companies can send fresh data updates to BI (Business Intelligence) tools and team members in near real-time with CDC, keeping them up to date.

What is Snowflake Streams?

In today’s data-driven economy, data in your systems change quite frequently, and it would be a complex task to load the data fully into Snowflake every time. It will cost you both time and money. This is where Snowflake Change Data Capture (CDC) comes in. You can implement CDC in Snowflake effectively by just using a few commands, and this is made possible with the concept of Streams in Snowflake.

A Snowflake Stream object basically tracks all DML changes made to rows in a source table and stores the metadata of each change. This metadata between two transactional points of time in a table is used later in order to retrieve the changed data.

Snowflake Streams capture an initial snapshot of all the rows present in the source table as the current version of the table with respect to an initial point in time. Streams then enable Change Data Capture every time you insert, update, or delete data in your source table. The Streams will have additional columns whenever any DML change is committed. So, by capturing the CDC Events you can easily merge just the changes from source to target using the MERGE statement.

Columns in Snowflake Stream

Snowflake streams contain following additional columns:

Column Name	Description
METADATA$ACTION	Indicates the DML operation (INSERT, DELETE) recorded.
METADATA$ISUPDATE	Indicates whether the operation was part of an UPDATE statement.
METADATA$ROW_ID	Specifies the unique and immutable ID for the row, which can be used to track changes to specific rows over time.

Types of Snowflake Streams

There are three types of Snowflake streams available based on the metadata recorded by each:

1. Standard

Supported for streams on standard tables, dynamic tables, Snowflake-managed Iceberg tables, directory tables, or views.
A standard (i.e., delta) stream tracks all DML changes to the source object, including inserts, updates, and deletes (including table truncates).
Standard streams cannot retrieve change data for geospatial data.

2. Append-Only

Supports streams on standard tables, dynamic tables, Snowflake-managed Iceberg tables, or views.
An append-only stream exclusively tracks row inserts. Update, delete, and truncate operations are not captured by append-only streams.
You can’t create an append-only stream in a target account using a secondary object as the source.

3. Insert-Only

It supports streams on dynamic tables, Iceberg tables, or external tables.
An insert-only stream tracks row inserts only; they do not record delete operations that remove rows from an inserted set (i.e., no-ops).
Overwritten or appended files are essentially handled as new files.

Integrate Amazon DocumentDB to Snowflake

Get a Demo Try it

Integrate Aftership to Snowflake

Get a Demo Try it

Integrate Asana to Snowflake

Get a Demo Try it

How to Setup Snowflake Change Data Capture with Streams?

To get started with Snowflake Change Data Capture, log in to Snowflake Web-Based UI or SnowSQL. Then, follow the below-mentioned steps.

Step 1: Run the following command to create a CDC_Stream.

create or replace database CDC_STREAM;
Use CDC_STREAM;

Step 2: You’ll need a source and a destination table to operate upon. Run the following command to create a source table named “employees_source”.

create or replace table employees_source (
id int,
first_name varchar(200),
last_name varchar(200) )

Step 3: Run the following command to create a destination table named “employees_destination”.

create or replace table employees_destination ( id int, first_name varchar(200), last_name varchar(200) );

Step 4: Run the following command to create a stream on top of the source table. This stream will track any data changes made in the EMPLOYEES table.

create or replace stream employee_stream on table employees_source;

Note: As long as there is no data management command to consume it, any changes to the source table with respect to data would be considered as INSERT and not UPDATE.

For the purpose of this demonstration, let’s add a couple of records to the employees_source table. Run the following command to do so.

insert into employees_source values (1,’Ron’,’Higgins’);
insert into employees_source values (2,’Joseph’,’Geller’);
insert into employees_source values (3,’Charlie’,’Bricks’);
insert into employees_source values (4,’Mathew’,’Smith’);
insert into employees_source values (5,’Daisy’,’Whelsh’);
insert into employees_source values (6,’Scott’,’Hall’);
insert into employees_source values (7,’Jose’,’Martinez’);
insert into employees_source values (8,’Bryan’,’Son’);
insert into employees_source values (9,’Nisha’,’Grant’);
insert into employees_source values (10,’Hamilton’,’Gell’);
insert into employees_source values (11,’Larry’,’Armstrong’);
insert into employees_source values (12,’Emilia’,’Rodrigez’);
insert into employees_source values (13,’Kate’,’Becket’);

Step 5: Let’s view the change log in the stream before proceeding.

select * from employee_stream;

As you can see, here are the records and metadata fields. There are 3 additional META columns (METADATA$ROW_ID, METADATA$ISUPDATE, and METADATA$ACTION) introduced and they make it very easy to detect if a row has been updated, deleted, or inserted. And, since it’s an initial load everything is inserted.

Now, there are no records in the destination table as they’re still in the stream.

Step 6: Let’s move the records to the destination using the MERGE statement.

Use CDC_STREAM;
MERGE into employees_destination as T
using (select *
from employee_stream) AS S
ON T.id = s.id
when matched AND S.metadata$action = ‘INSERT’ AND S.metadata$isupdate
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.metadata$action = ‘DELETE’ THEN DELETE
when not matched
And S.metadata$action = ‘INSERT’ THEN
INSERT (id,
first_name,
last_name)
VALUES (S.id,
S.first_name,
S.last_name);

Now, the data has moved to the destination and there will be nothing in the stream. After the MERGE command has consumed, the stream object will become empty.

The cycle continues, and the stream will continue to record all the changes that happen in the employees_source table.

Step 7: Let’s view the destination table now. Run the following command to do so.

select * from employees_destination;

As you can see, the destination table is updated with the records.

Step 8: Now, it’s time to demonstrate the Snowflake Change Data Capture. For that purpose, go ahead and make a change in the source table. Let’s update a record and observe the stream.

update employees_source
set last_name="Harper"
where id=1;

This basically sets the last_name as “Harper” for id=1.

While observing the stream, you’ll find 2 records, INSERT and DELETE. This is because the original last_name (i.e Higgins) was deleted for id=1 and the new last_name (i.e Harper) was updated.

Use the MERGE command to update the destination table as shown below.

Use CDC_STREAM;
MERGE into employees_destination as T
using (select *
from employee_stream
Where Not (metadata$action = ‘DELETE’ AND metadata$isupdate = TRUE)) AS S
ON T.id = s.id
when matched AND S.metadata$action = ‘INSERT’ AND S.metadata$isupdate
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.metadata$action = ‘DELETE’ THEN DELETE
when not matched
And S.metadata$action = ‘INSERT’ THEN
INSERT (id,
first_name,
last_name)
VALUES (S.id,
S.first_name,
S.last_name);

Step 9: Run the following command to take a look at the destination table.

select * from employees_destination;

Here, you will be clearly able to observe the changes made to the source table updated in the destination table. That’s it, that’s how easy Snowflake Change Data Capture is.

For a better understanding, you can try out Snowflake Change Data Capture by deleting or updating other records in the source table. And, don’t forget to use the MERGE command for the changes to reflect in the destination table.

Permissions required to query a stream

To query a stream, you require a role with a minimum of the following role permissions:

Object	Privilege
Database	Usage
Schema	Usage
Stream	Select
Table	Select
View	Select
External stage	Usage
Internal stage	Read

What are the Limitations of Snowflake Streams?

There are a few limitations to using Snowflake Streams for CDC:

You can’t use standard or append-only streams on Iceberg tables that support an external catalog. (Insert-only streams are supported.)
You can’t track changes on a view with UNION ALL of a table and itself or a clone of itself.
You can’t track changes on a view with GROUP BY clauses.
You can’t use UNION ALL of a table with itself for streams on views.
If a column is added or modified to NOT NULL, queries on streams may fail because the stream might be outputting rows containing impermissible NULL values.

Do’s and Don’ts of Snowflake CDC

Do’s for Snowflake Change Data Capture:

Use Streams for CDC: Use Snowflake’s Streams to capture the inserts, updates, and deletes happening on your data tables transparently.
Schedule CDC Queries at Regular Intervals: Schedule CDC queries to run at regular intervals so they do not lag but capture changes continuously.
Use Task Scheduling: Automate the execution of CDC processes by scheduling tasks from Snowflake to trigger running pipelines based on timed or event-based conditions.

Don’ts of Snowflake Change Data Capture:

Don’t Ignore Data Validation: Validate CDC data regularly to avoid errors caused by incomplete or corrupt data updates.
Don’t Overload CDC Streams: Avoid running multiple streams on the same table, which can lead to inefficiencies and affect performance.
Don’t Forget to Manage Stream Offsets: Ensure that stream offsets are correctly managed to avoid reprocessing or missing changes.
Don’t Overlook Data Archiving: To prevent performance degradation, regularly archive old data, especially in large transactional systems.

Learn More About:

Conclusion

Snowflake Change Data Capture has totally replaced the old ways of implementing CDC. Although Snowflake is already an industry-leading Cloud-Based Data Platform known for its speed and flexible warehousing options, Snowflake CDCs make it much worth it to use.

Snowflake Change Data Capture proves to be very helpful in cases where millions of records get transacted on a daily basis but you only want to update the modified ones. Doing a full load will eat up your resources and time, so just leverage the Snowflake Change Data Capture and use the MERGE command to update the destination.

Easily share and access live, governed data with Snowflake Data Exchange to boost collaboration. Learn more at exploring Snowflake Data Exchange.

However, hand-coding the CDC infrastructure can prove to be a tedious task without dedicated engineering resources. It is a lot more efficient to invest in an out-of-the-box tool like Hevo. Sign up for Hevo’s 14-day free trial and get to know more.

Hevo Data with its strong integration with 150+ Sources allows you to not only export data from sources & load data to the destinations like Snowflake in real-time, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.

Frequently Asked Questions

1. Does Snowflake have change data capture?

Yes, Snowflake supports Change Data Capture (CDC) using features like Streams and Tasks, which track changes in tables and process them.

2. How does change data capture work?

Change Data Capture (CDC) identifies and tracks changes (inserts, updates, deletes) in a data source and transfers those changes to another system, ensuring data consistency across systems.

3. What is the CDC process in Snowflake?

In Snowflake, CDC is implemented using Streams to capture changes in tables and Tasks to process or move the changed data. Streams track changes in a table (since the last query) and allow for further actions based on these changes.

Raj Verma Business Analyst, Hevo Data

Raj, a data analyst with a knack for storytelling, empowers businesses with actionable insights. His experience, from Research Analyst at Hevo to Senior Executive at Disney+ Hotstar, translates complex marketing data into strategies that drive growth. Raj's Master's degree in Design Engineering fuels his problem-solving approach to data analysis.

What is Snowflake?

Key Features of Snowflake

What is Change Data Capture (CDC)?

What is Snowflake Streams?

Columns in Snowflake Stream

Types of Snowflake Streams

How to Setup Snowflake Change Data Capture with Streams?

Permissions required to query a stream

What are the Limitations of Snowflake Streams?

Do’s and Don’ts of Snowflake CDC

Do’s for Snowflake Change Data Capture:

Don’ts of Snowflake Change Data Capture:

Conclusion

Frequently Asked Questions

1. Does Snowflake have change data capture?

2. How does change data capture work?

3. What is the CDC process in Snowflake?

Related Articles

Optimize your data integration with Hevo!

Related articles