Snowflake is a modern-day, easy-to-use analytics Data Warehouse designed for the cloud. Built on top of AWS, Snowflake’s unique architecture uses a SQL Database engine. This makes Snowflake fast and flexible. One of the biggest challenges to setting up the Snowflake Data Warehouse is to bring real-time data from all the different applications into Snowflake.
Snowflake is a widely used Data Warehouse and companies store business data for analyzing it and generating insights. The raw data in different sources are heterogeneous and the Snowflake ETL process needs to be followed to load data into Data Warehouse with a common format.
The snowflake ETL process needs to be planned and executed with considering some essential points for completing the whole ETL process efficiently. This article covers vital Snowflake ETL best practices while migrating data to the Snowflake Cloud Data Warehouse.
Table of Contents
What is Snowflake?
Snowflake is a Cloud Data Warehouse that comes with a fast, reliable, and robust engine for storing and querying data. It is built on AWS (Amazon Web Services), Microsoft Azure, and Google infrastructure. It is a fully managed Cloud Data Warehouse that helps companies and organizations store data and run Analytics.
Apart from fast query performance, it also offers data security using Amazon S3 Policy Controls, SSO, Azure SAS Tokens, and Google Cloud Storage access permissions. It is highly scalable and can easily handle terabytes of data. Users can easily load data from multiple sources to the Warehouse Snowflake ETL process.
Key Features of Snowflake
Some of the main features of Snowflake are listed below:
- Standard SQL Support: Snowflake allows its users to use the standard SQL language to query data from the Warehouse. Also, it comes with advanced DML support for the analytical extension.
- Fault-Tolerant: Snowflake ensures data protection by keeping the backup of data to avoid any data loss during the system failure.
- Integrations: Snowflake supports many integrations with 3rd party tools and services for easy data access.
To know more about Snowflake, click here.
What is Snowflake ETL?
Snowflake ETL is the process of Extracting data from data sources, Tranforming it into compatible formats, and Loading it into the Data Warehouse. With the help of the Snowflake ETL process, companies can easily transfer their business data from SaaS applications, Databases, and other data sources into Snowflake.
Hevo Data, a Fully-managed Snowflake ETL/ELT solution, can help you automate, simplify & enrich your Analytics application process within a matter of a few clicks. With Hevo’s support for a holistic number of connectors and quick Data Pipelines, you can extract load, and transform data from 150+ Data Sources (including 40+ free sources) straight into your Data Warehouse, Database, or any destination of your choice. To further streamline and prepare your data for analysis, you can leverage your Snowflake ETL solution without writing a single line of code!
Get Started with Hevo for Free
“With Hevo in place, you can reduce your Data Extraction, Transformation, Load, and Enrichment time & effort by many folds! In addition, Hevo’s pre-built integrations with various Business Intelligence & Analytics Tools such as Power BI, Tableau, and Looker will enhance your reporting & dashboarding experience, allows you to perform Predictive Analytics, and gain actionable insights with ease!”
Experience an entirely automated hassle-free No-code Snowflake ETL Solution. Try our 14-day full access free trial today!
Top 7 Snowflake ETL Best Practices
Here are some of the best practices that you can follow while performing the Snowflake ETL process:
Data Warehouse Considerations
Before you jump into optimizing for Snowflake ETL, you would have to understand the levers that matter. Though, performing the Snowflake ETL process is time-consuming and cumbersome because it doesn’t provide exclusive options for transformation. Using automated Data Pipelines provides an easy way to transform your data. No Code Data Pipelining tools such as Hevo can do this job efficiently.
1. Understanding Credit Charges on Snowflake
Snowflake offers pay-per-second billing. They allow the different sizes of the warehouse (Large, X-Large, 2X-Large, etc). Credit charges are calculated based on:
- The number of Clusters (if using a multi-cluster warehouse)
- The number of servers per cluster (a derivative of Data Warehouse size)
- The time duration that each cluster in each server runs for
Having an understanding of this gives you the flexibility to run as many clusters as you want and suspend them when not in need. This, in turn, would help you act smartly and save costs while performing the Snowflake ETL process. Try out combinations of different query types and warehouse sizes. This will help you arrive at the right combination for your workloads.
2. Impact of Query Composition on Snowflake
Like in most Data Warehouses, the size and the complexity of the query determine the number of servers needed to process the query. For complex queries, the number of rows has less impact in comparison to the overall data size. The number of table joins, filtering using predicates, etc. has an impact on query processing hence utilize them carefully.
3. Impact of Caching on Queries
Caching is standard practice in Data Warehousing as it improves the performance of the Warehouse as it enables subsequent queries to read from the cache instead of source tables.
While you consider suspending a warehouse to save credits, remember that you would be letting go of the data in the cache too. This will have an impact on query performance.
Table Design Considerations
While designing your tables in Snowflake, you can take care of the following pointers for efficiency:
1. Referential Integrity Constraints
It is recommended to implement referential integrity constraints in Snowflake. They provide valuable metadata that users can use to understand the schema and the relationships defined between tables.
In Snowflake, referral integrity constraints are not enforced by default. When created, they are disabled. NOT NULL is an exception and is enforced by default.
2. Utilizing Clustering Keys
For Big Data sets, clustering is a good practice and helps improve query performance.
In Snowflake, there exists an automatic tuning and micro-partitioning feature. Most often, users load the data into Data Warehouse using the Snowflake ETL process, organize into micro-partitions by timestamp or date and query it along the same dimension.
Clustering Keys can be Helpful in the Following Scenarios
- Consider an example where Snowflake ETL loads data in Warehouse by timestamp, but the data is queried by ID. In cases where data is queried in a dimension different from what is loaded while ETL in Snowflake, clustering will be helpful.
- In case you figure through Query Profiling that a significant amount of the total query time is spent only in scanning the table, clustering can be of help. Often this applies to queries that filter many columns.
Here are a Few Things to Note when Clustering Data
- One will have to manually run a DML statement to recluster a table.
- The existing data will be re-written in a different order upon reclustering. Snowflake saves the previous order for 7 days in order to provide Fail-safe protection, one of Snowflake’s Cloud Data Warehouse’s most lauded features.
- Reclustering a table on Snowflake costs additional dollars. This is directly proportional to the size of the data set that is re-ordered.
Data Storage Considerations
Snowflake provides an array of features for data that is stored. Continuous Data Protection (CDP) which includes Fail-Safe and Time Travel is given to all Snowflake accounts for no additional cost. This does not mean that CDP will not have an impact on your storage costs. On the contrary, it will.
Note, that your account will be charged for all the data stored in schemas, tables, and databases created in your Snowflake Architecture. Based on the data stored and the duration for which it is stored, CDP has an impact on the storage costs.
Until the data leaves the Fail-safe state, storage costs will be incurred. This means that you pay for the data storage irrespective of whether it is in Active, Time-travel or Fail-safe State. Hence, it is important to make the right storage considerations.
1. Staged File Storage While Loading Data
To assist in loading bulk data into tables during the Snowflake ETL process, it has a feature called Stages where files that have the data to be loaded are staged. Snowflake allows both internal (within Snowflake) and external (S3, Azure) stages. No additional cost is charged for Time Travel and Fail-safe features for data stored in internal stages within Snowflake. However, standard data storage costs apply. For bulk data, you could utilize this feature.
2. Cloning Tables, Schemas, and Databases
Snowflake has a zero-copy cloning feature that gives an easy way to take a “snapshot” of any schema, table, or database. This feature creates a derived copy of that object which initially shares the underlying storage. This can come in handy when creating instant backups. This does not incur any additional costs as long as you do not need to make any changes to the cloned object.
Data Loading Considerations
Snowflake ETL process consists of transforming raw data to the format of Data Warehouse. Snowflake Architecture makes it time-consuming for users to load data to the Data Warehouse due to data mapping complexity. Hevo provides auto-mapping features for a seamless Snowflake ETL process.
1. Preparing Your Data Files
The foremost stage of performing the Snowflake ETL process is preparing the data files that you want to load into Data Warehouse.
2. General File Sizing Recommendations
While performing Snowflake ETL, it is important to optimize the number of parallel loads into Snowflake. It is recommended to create compressed data files that are roughly 10 MB to 100 MB in size.
Aggregate the smaller files to reduce processing overhead. Split the large files into a number of smaller files for faster load during Snowflake ETL. This allows you to distribute the load between servers in the active Snowflake warehouse.
3. Data Size Limitations for Semi-Structured Data
The VARIANT data type has a 16 MB (compressed) size limit on the individual rows.
Often JSON and Avro are the most commonly used data formats. Both JSON and Avro are a concatenation of many documents. The source software that provides JSON or Avro output will provide the output in the form of a single huge array having multiple records. Both line breaks and commas are supported for document separation.
For efficiency enhancement, while executing the COPY INTO <table> command it is recommended to enable the STRIP_OUTER_ARRAY file format option. The Snowflake ETL will load the records into separate table rows by removing the outer array structure. Below is an example:
COPY INTO <table_name>
file_format = (type = 'JSON' strip_outer_array = true);
4. Data Size Limitations of Parquet files
It is recommended to split parquet files that are greater than 3GB in size into smaller files of 1GB or lesser for smooth Snowflake ETL. This will ensure that the loading does not time out.
5. Preparing Delimited Text Files
The following points must be considered while preparing CSV/Delimited text files for Snowflake ETL:
- Files must have data in ASCII format only. The default character set is UTF-8. However, additional encodings can be mentioned using ENCODING file format option.
- Within the files, records and fields should be delimited by different characters. Note, that both should be a single (necessarily not same) character. Pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (n) is used as a row delimiter.
- Fields that have delimiters should be enclosed in single or double quotes. If the data being loaded contains quotes, then those must be escaped.
- Fields that have carriage returns (r n) should be enclosed in single or double quotes too. In the windows system, carriage returns are commonly introduced along with a line feed character to mark the end of a line.
- Each row should have the same number of columns.
Setting up a Manual Snowflake ETL Solution is no simple task for a normal person and having an efficient automated solution is the need of the hour! This problem and many more can be solved using Hevo’s No-Code Data Pipeline seamlessly.
This Snowflake ETL Tool lets you connect to a Data Source of your choice and leverage Hevo’s blazing-fast Data Pipelines to help you seamlessly extract, transform, and load data to build your Snowflake – a Single Source of Truth in a matter of minutes! All this without writing any code!
Check out why Hevo is the Best:
- Integrations: Hevo’s fault-tolerant Data Pipeline offers you a secure option to unify data from 150+ sources (including 40+ free sources) and store it in Snowflake or any other Data Warehouse of your choice. This way you can focus more on your key business activities and let Hevo take full charge of the Data Transfer process.
- Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Try our 14-day free trial & set up your Snowflake ETL with ease!
Load Data to Snowflake with Hevo for Free
Planning a Data Load
Optimizing the Snowflake ETL data load of large data sets goes a long way in optimizing the Snowflake ETL process. When dealing with large data sets, it is advised to dedicate individual Snowflake Clusters for loading and querying operations. This helps in optimizing the Snowflake ETL performance.
A standard Virtual Warehouse is enough for Snowflake ETL as loading requires fewer resources. Based on the speed at which you want to load data, you can choose the size of the Data Warehouse. Remember to split large data files for faster loading during the Snowflake ETL process.
Data Staging Considerations
Both Snowflake and your data source (Azure/S3) allow stage references via paths. While performing Snowflake ETL, it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers or geographical location, etc., along with the date when the data was written.
This exercise will provide you the flexibility to copy files by path using a single command. This will allow you to take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.
Data is loaded into Snowflake using the COPY INTO <table> command. Understanding the levers of this command can go a long way in helping you optimize the Snowflake ETL process. As an alternative to this time-consuming process, you can also use the No Code Data Pipeline tool such as Hevo to automate the whole process.
In order to execute the COPY command data from staged files to an existing table. The possible staging locations are as follows:
- Internal stage (table/user stage). PUT command can be used to stage files.
- AWS S3 bucket or Microsoft Azure container can be referenced from an external stage.
- An External location (AWS S3 bucket or Microsoft Azure container)
Options for Selecting Staged Data Files:
The COPY command accepts several options for Snowflake ETL loading data from a stage. This will enable you to bring only a fraction of data that is staged into Snowflake. Here are some options:
- Mention internal stages by path
- Use prefix to load data from Amazon S3 bucket
- Mention specific files to load
- Load data from specific files by using pattern matching
For details on the implementation of each of these, refer to Snowflake documentation.
Note that, as a general practice, while performing data loads for Snowflake ETL, it is important to partition the data in your Snowflake or external locations like S3 buckets or Azure containers using logical, granular paths. By creating a partition using details like location or application along with the date when this data was written, you are optimizing for a later data loading activity.
When you load the data using Snowflake ETL, you can simply copy any fraction of the partitioned data into Snowflake with a single command. You can copy data into Snowflake by the hour, day, month, or even year when you initially populate tables.
In this article, you read about Snowflake ETL and how it helps users to load data from data sources to Data Warehouse. Also, you read about some of the best practices to follow while performing the Snowflake ETL process. The snowflake ETL process can be a cumbersome process when it comes to manually transform and monitoring the Snowflake ETL data loading.
Hevo offers an entirely Automated Hassle-Free Data Integration with 150+ Data Sources (including 40+ free sources) to deliver your data in real-time without any loss from source to Snowflake. It empowers you to leverage & mine your extracted data to run Predictive Analytics and gain actionable insights with ease!
In addition to transferring data, you can also build aggregates and joins and data models on Snowflake to create materialized views that enable faster query processing. Hevo integrates with a variety of data sources ranging from SQL, NoSQL, SaaS, File Storage Base, Webhooks, etc., with the click of a button. Hevo can help you Seamlessly Extract, Load, and Transform data from 100+ data sources to Snowflake with a No-code, easy-to-setup interface. Try our 14-day full access free trial!
Get Started with Hevo for Free
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of understanding Snowflake ETL Best Practices in the comment section below! We would love to hear your thoughts.