Snowflake Testing for Data Warehouses Simplified 101

on Amazon S3, BI Tool, Data Aggregation, Data Warehouses, ETL, ETL Testing, Snowflake, Teradata • September 30th, 2021 • Write for Hevo

A Data Warehouse has become the need of the hour today because most companies are shifting to data-driven solutions. Businesses use multiple online tools and services to reach out to customers, optimize workflow, and manage other business activities. All the data generated by these platforms need to be stored in an analysis ready-form so that BI tools can consume it to generate insights. 

But how can you directly load data from data sources to Data Warehouses such as Snowflake when it’s inconsistent? Snowflake Testing is a method to test data inside the Data Warehouse for integrity, accuracy, reliability, and consistency before feeding it to other BI tools. According to the statistics by Experian Data Quality, 75% of businesses are wasting 14% of revenue due to poor data quality, and Data quality costs companies an estimated $14.2 million annually. 

Inconsistent data can generate false reports and negatively affect business decisions. To avoid all these errors, Snowflake Testing plays a vital role in delivering high-quality data. In this article, you will learn about Snowflake Data Warehouse, ETL Testing, and different methods that one can use for Snowflake Testing.

Table of Contents

Introduction to Snowflake

Snowflake Logo
Image Source

Snowflake is a SaaS (Software as a Service) based Data Warehouse platform built on the AWS (Amazon Web Services) infrastructure. It is a fully managed Cloud Data Warehouse that helps companies and organizations store data and run analytics. The users don’t have to maintain their hardware and software upgrades or pay for any hardware failure because companies only rent the storage space and computational power in the Data Warehouse. 

Snowflake is different from other traditional Data Warehouses as it separates storage and computational power for faster query performance. It also allows users to independently scale up or down storage and compute as per the requirements. Snowflake is highly scalable that can handle terabytes of data with the help of its architecture that involves virtual compute instances and efficient storage buckets that run solely on the Cloud.

Snowflake offers data security using Amazon S3 Policy Controls, SSO, Azure SAS Tokens, and Google Cloud Storage access permissions. It uses elastic Azure Blobs Storage for internal storage engine and Azure Data Lake to store unstructured, structured, and on-premise data.

Key Features of Snowflake

Snowflake provides Data Warehouse services to enterprises to make their data available in the analysis-ready form. It is highly scalable and suitable for running Business Analytics. A few features of Snowflake are listed below.

  • Integrations: Snowflake supports integration with many 3rd party tools and services that companies use for daily use. It helps users easily sync data between source and Data Warehouse and feed it to BI (Business Intelligence) tools.
  • Security: The communication between client and user is always protected using TLS (Transport Layer Security). Snowflake allows users to select geolocation for storing data.
  • Standard SQL: Snowflake contains most DDL and DML in SQL. In addition, it also has the most advanced DML in Snowflake, which helps in analytical extension. 

To learn more about Snowflake, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Introduction to ETL Testing

ETL is the process of Extracting data from data sources, Transforming it as per business requirements, and then Loading it to the Data Warehouse as a destination. While transferring or migrating data from source to destination, such as Snowflake or Google BigQuery, it is common that error occurs due to inconsistency in data. ETL Testing is a process that enables users to test or validate the ETL process by comparing data in source and destination. It is a kind of black-box testing to ensure that all the data from the source is migrated to Data Warehouse and avoid any data loss and data duplication. ETL Testing consists of different phases as shown in the image below.

Snowflake Testing - Different Phases of ETL Testing
Image Source

Unlike all other Testing processes, ETL Testing also requires information on business requirements. The information is used to build strategies and estimating the time required to do an entire ETL Testing. At last, a detailed report is created after the execution of all the phases of ETL Testing.

Introduction to Snowflake Testing

Snowflake Testing Cover Image
Image Source: Self

Snowflake stores data from multiple data sources and data inconsistency is inevitable which results in data loss during data loading. An enterprise cannot ignore data loss because it is valuable for them. Snowflake Testing is a process that companies follow to avoid any data loss and maintain data integrity. 

The huge chunks of data have to be accurate to feed it to BI (Business Intelligence) tools and get error-free analytics. Most ETL processes are complex and contain many errors. Transferring or migrating data to Snowflake in such an environment is time-consuming, costly, and not effective because it can’t maintain data integrity.

Snowflake Testing Process

Snowflake Testing Process Cover Image
Image Source: Self

The Left Shift Snowflake Testing approach is based on constant testing in between each of the ETL data transfer steps. There are many tools available in the market for automated Snowflake Testing. The tools are designed to provide detailed test coverage during the Snowflake Testing process. The Snowflake ETL Testing process compares the data in a data source with data loaded to Snowflake. It involves the process of comparing data row by row to make sure all the data loads correctly. In this section, you will learn about different methods for Snowflake Testing using the example of data in Teradata as a data source and Snowflake. The following methods for Snowflake Testing are listed below: 

1) Testing the Database Schema

The first step is to transfer database objects from the data source to the Snowflake Data Warehouse. It will move Schemas, Tables, Views, sequences, and other objects supported by Snowflake.

One can run tests after successful schema transfer to detect any missing columns or incorrectly mapped data types between data source and Snowflake.

Schema Compare Test

In this Snowflake Testing phase, you need to compare Schema names, Table names, View names, Column names, and corresponding Datatypes. While creating tests, one must ensure correct datatype mapping between data source and Snowflake.

You can query the system tables of data source and Snowflake to compare names of Schema, Columns, Views, etc. Based on how you mapped data in Data Warehouse during the ETL process, you need to change the queries for the data source in this Snowflake Testing step. Code for comparing Schema is given below.

Teradata:
SELECT
UPPER(TableName),
UPPER(ColumnName),
CASE
WHEN ColumnType = 'CF' AND CharType = 1 THEN 'TEXT'
WHEN ColumnType = 'CV'THEN 'TEXT'
WHEN ColumnType = 'CO'THEN 'TEXT'
WHEN ColumnType = '11'THEN 'NUMBER'
WHEN ColumnType = '12'THEN 'NUMBER'
WHEN ColumnType = '18'THEN 'NUMBER'
WHEN ColumnType = 'T'THEN 'NUMBER'
WHEN ColumnType = 'D'THEN 'NUMBER'
WHEN ColumnType = 'F'THEN 'FLOAT'
WHEN ColumnType = 'N'THEN 'NUMBERIC'
WHEN ColumnType = 'DA'THEN 'DATE'
WHEN ColumnType = 'AT'THEN 'TIME'
WHEN ColumnType = 'SZ'THEN 'TIMESTAMP WITH TIME ZONE'
WHEN ColumnType = 'TS' THEN 'TIMESTAMP_NTZ'
WHEN ColumnType = 'TZ' THEN 'TIME WITH TIME ZONE'
WHEN ColumnType = 'JN' THEN 'VARIANT'
WHEN ColumnType = 'A1'THEN 'ARRAY'
WHEN ColumnType = 'AN' THEN 'ARRAY'
WHEN ColumnType = 'CF'AND CharType = 4 THEN 'BINARY'
WHEN ColumnType = 'BO' THEN 'BINARY'
WHEN ColumnType = 'BF' THEN 'BINARY'
WHEN ColumnType = 'BV' THEN 'BINARY'
END as equivalent_snowflake_datatype
FROM DBC.ColumnsV
Snowflake:
SELECT
UPPER(t.table_name) table_name,
UPPER(c.column_name),
c.data_type
FROM
information_schema.tables t
INNER JOIN information_schema.columns c ON (t.table_name = c.table_name)
WHERE t.table_name = 'SAMPLE_DATATYPE_TABLE' AND t.table_schema = 'PUBLIC'

2) Testing Initial Data Loads

The Snowflake Testing step involves one-time loading of historical data from data sources to Amazon S3 or Azure Blob, or Google Cloud Storage, and then into Snowflake. In this phase, one needs to detect any data truncation, transformation, or other data issues while loading data. Loading historical data contains many rows, which makes the dataset huge. So this SnowflakeTesting stage requires different tests for timely execution and thorough testing.

Row Count Test

This test compares the numbers of rows available in the data source and Snowflake. It is a basic test but provides information about missing data. Code for Row Count Test is given below.

Teradata:
SELECT CAST(COUNT(*) AS BIGINT) AS SOURCE_COUNT FROM TABLE
Snowflake:
SELECT COUNT (*) AS TARGET_COUNT FROM TABLE

Data Compare Test

This phase becomes an essential part of Snowflake Testing because it compares data row by row and column by column between the data source and Snowflake data. It detects the exact data issues that occurred during the ETL phase, and users can fix those issues before resuming data loading. 

It helps to identify the following issues listed below: 

  • Data Truncation
  • Floating-point issues
  • Incorrect DateTime
  • Special character conversion issues

Data Aggregation Test

This Snowflake Testing phase is suitable for numeric data instead of comparing data row by row, you can run the aggregate tests by summing up the column in both data source and Snowflake and comparing the values. If the test fails then you have to go for a row by row comparison for the faulty columns in the data. Code for performing data aggregation tests in Teradata and Snowflake is given below.

Teradata:
SELECT
1 DUMMY_ID,
SUM(CAST(INT_COLUMN AS DECIMAL(38,0))) AS INT_COLUMN,
SUM(CAST(FLOAT_COLUMN AS DECIMAL(38,0))) AS FLOAT_COLUMN,
SUM(CAST(NUMERIC_COLUMN AS DECIMAL(38,0))) AS FLOAT_COLUMN
...
FROM TABLE
Snowflake:
SELECT
1 DUMMY_ID,
SUM(INT_COLUMN) AS INT_COLUMN,
SUM(FLOAT_COLUMN) AS FLOAT_COLUMN,
SUM(NUMERIC_COLUMN) AS FLOAT_COLUMN
...
FROM TABLE

3) Testing Delta Loads

Delta Loads are the daily transactions occurring from data source to Data Warehouse. It is important to treat these data loads separately. You need to ensure that the delta loads are in sync while doing Snowflake Testing.

Data Compare Test

This phase of Snowflake testing is used to create tests to compare delta loads and detect data issues constantly until the data source is switched off. 

Data Validation Test

The real-time data loads don’t have primary key constraints. So you have to also check for any duplicate rows in data. This Snowflake Testing helps you quickly identify all the upstream processes.

Snowflake:
SELECT
COLUMN_NAME,
COUNT(*) DUPLICATE_ROWS
FROM TABLE
GROUP BY COLUMN_NAME
HAVING COUNT(*) > 1

4) Testing Downstream Systems

It is essential to maintain uniform data flow from both data sources and Snowflake. ETL Tools and BI Tools that are taking data from these sources. 

Data Compare Test

This phase of Snowflake Testing ensures that the downstream applications are delivering the same output, whether the data is coming from a data source or Snowflake. So one needs to run Data Compare Tests.

Teradata:
SELECT * FROM TABLE
Snowflake:
SELECT * FROM TABLE

The above code will throw all the rows of the table. You can perform row by row comparison test manually. 

Challenges of Snowflake Testing

Snowflake Testing is a time-consuming process, and one error can cost companies huge data losses. There are many challenges faced by users in Snowflake Testing. A few of them are listed below: 

  • ETL Testing is a type of black-box testing that involves testing without looking at the internal structure, so there are no specific rules, and there is no graphical user interface for easy workflow.
  • Comparing large datasets manually is a time-consuming process and highly error-prone. Even comparing small chunk data can cause human errors.
  • A company uses many data sources, and reaching out to all the data sources manually and comparing data with Snowflake is a tedious process.

Conclusion

In this article, you learnt about ETL Testing and why it is needed in the first place to get high-quality data. You also read about different Snowflake Testing Techniques and how Snowflake Testing maintains data integrity. Manually performing Snowflake Testing is a time-consuming process. There are many ETL Testing tools available in the market that automates the ETL Testing process for you. Snowflake allows users to load structured and unstructured data that eliminates the Snowflake Testing.

Visit our Website to Explore Hevo

Companies have business data available in multiple sources, and it’s a tedious process to load data manually from data sources to Snowflake. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to the desired Snowflake. It fully automates the process to load and transform data from 100+ sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Snowflake Testing in the comments section below!

No-code Data Pipeline For your Snowflake