Snowflake is a modern, easy-to-use analytics Data Warehouse designed for the cloud. Built on top of AWS, Snowflake’s unique architecture uses a SQL Database engine, making It fast and flexible. One of the biggest challenges to setting up the Snowflake Data Warehouse is bringing real-time data from all the different applications into Snowflake.

Snowflake is a widely used data warehouse where companies store business data to analyze and generate insights. The raw data from different sources is heterogeneous, and the Snowflake ETL process needs to be followed to load data into the Data Warehouse in a standard format.

The Snowflake ETL process needs to be planned and executed, and some essential points need to be considered to complete it efficiently. This article covers the best practices for migrating data to the Snowflake Cloud Data Warehouse.

What is Snowflake ETL?

Snowflake ETL: Snowflake Logo | Hevo Data

Snowflake is a Cloud Data Warehouse with a fast, reliable, and robust engine for storing and querying data. It is built on AWS (Amazon Web Services),  Microsoft Azure, and Google infrastructure. It is a fully managed Cloud Data Warehouse that helps companies and organizations store data and run analytics.

Snowflake ETL is the process of Extracting data from data sources, Transforming it into compatible formats, and Loading it into the Data Warehouse. With the help of these processes, companies can easily transfer their business data from SaaS applications, Databases, and other data sources into Snowflake.

Enhancing Snowflake ETL Performance with Hevo

Hevo is designed to empower businesses by automating data workflows. With its integration capabilities for Snowflake, Hevo not only simplifies data loading but also enhances the overall ETL performance, making data management hassle-free. Know why Hevo is the Best:

  • Cost-Effective Pricing: Transparent pricing with no hidden fees, helping you budget effectively while scaling your data integration needs.
  • Minimal Learning Curve: Hevo’s simple, interactive UI makes it easy for new users to get started and perform operations.
  • Schema Management: Hevo eliminates the tedious task of schema management by automatically detecting and mapping incoming data to the destination schema.
Get Started with Hevo for Free

Why Do You Need a Dedicated ETL Tool with Snowflake?

While Snowflake offers robust data storage and analytics capabilities, utilizing a dedicated ETL (Extract, Transform, Load) tool can significantly enhance your data integration and management processes. Here are several key reasons why you should consider using a separate ETL tool with Snowflake:

  1. Streamlined Data Integration: ETL Tools ETL tools will simplify how data is extracted from different sources, transformed for use in a usable format within Snowflake, and loaded. It saves a tremendous amount of manual effort, while the integration process is faster and more efficient.
  1. Complex Transformations: Dedicated ETL tools offer strong transformation capabilities. This helps clean, enrich, and shape data before loading it into Snowflake, helping ensure the data’s analytics readiness.
  1. Automation and Scheduling: With ETL tools, you’ll find automation features scheduling routine data loads. Thus, you will not need manual intervention to update your Snowflake environment with the latest real-time data.
  1. Error Handling and Logging: A dedicated ETL tool has the capability to allow robust error handling and logging, which will help efficiently identify and solve problem areas in data extraction as well as loading. Error handling and logging reduce downtime and thus ensure data accuracy.
  1. Supports Diverse Sources of Data: ETL tools can connect to virtually all sources of data, such as databases, APIs, cloud storage, and much more. This diversity makes it easy to incorporate diverse datasets into Snowflake, thus giving you a holistic view of your data.
  1. Scalability: With your growing data volumes, a dedicated ETL tool will scale with your needs and will manage larger datasets and complex transformations without performance compromise.
  1. User-friendly interfaces: Many ETL tools are supported with intuitive user interfaces so that the workflows can be built in an extremely visual manner. Therefore, a non-technical contributor for teams becomes engaged with ETL processes to a good extent.
  2. Cost efficiency: Snowflake charges on usage, but an optimized ETL process helps contain costs while minimizing data transfers and storage. ETL tools can be applied to help design efficient data workflows that represent overall cost reduction.

Top 7 Snowflake ETL Best Practices

Here are some of the best practices that you can follow:

Data Warehouse Considerations

Before you jump into optimizing for Snowflake ETL, you would have to understand the levers that matter. Though, performing these processes is time-consuming and cumbersome because it doesn’t provide exclusive options for transformation. Using automated Data Pipelines provides an easy way to transform your data. No Code Data Pipelining tools such as Hevo can do this job efficiently.

1. Understanding Credit Charges on Snowflake

Snowflake offers pay-per-second billing. They allow the different sizes of the warehouse (Large, X-Large, 2X-Large, etc). Credit charges are calculated based on:

  • The number of Clusters (if using a multi-cluster warehouse)
  • The number of servers per cluster (a derivative of Data Warehouse size)
  • The time duration that each cluster in each server runs for

Having an understanding of this gives you the flexibility to run as many clusters as you want and suspend them when not in need. This, in turn, would help you act smartly and save costs while performing the Snowflake ETL process. Try out combinations of different query types and warehouse sizes. This will help you arrive at the right combination for your workloads.

2. Impact of Query Composition on Snowflake

Like in most Data Warehouses, the size and the complexity of the query determine the number of servers needed to process the query. For complex queries, the number of rows has less impact in comparison to the overall data size. The number of table joins, filtering using predicates, etc. has an impact on query processing hence utilize them carefully.

3. Impact of Caching on Queries

Caching is standard practice in Data Warehousing as it improves the performance of the Warehouse as it enables subsequent queries to read from the cache instead of source tables.

While you consider suspending a warehouse to save credits, remember that you would be letting go of the data in the cache too. This will have an impact on query performance.

Table Design Considerations

While designing your tables in Snowflake, you can take care of the following pointers for efficiency:

1. Referential Integrity Constraints

It is recommended to implement referential integrity constraints in Snowflake. They provide valuable metadata that users can use to understand the schema and the relationships defined between tables.

In Snowflake, referral integrity constraints are not enforced by default. When created, they are disabled. NOT NULL is an exception and is enforced by default.

2. Utilizing Clustering Keys

For Big Data sets, clustering is a good practice and helps improve query performance.

In Snowflake, there exists an automatic tuning and micro-partitioning feature. Most often, users load the data into Data Warehouse using the Snowflake ETL process, organize into micro-partitions by timestamp or date and query it along the same dimension. Clustering keys can be helpful in the following scenarios:

  1. Consider an example where Snowflake ETL loads data in Warehouse by timestamp, but the data is queried by ID. In cases where data is queried in a dimension different from what is loaded while ETL in Snowflake, clustering will be helpful.
  2. In case you figure through Query Profiling that a significant amount of the total query time is spent only in scanning the table, clustering can be of help. Often this applies to queries that filter many columns.

Here are a Few Things to Note when Clustering Data

  1. One will have to manually run a DML statement to recluster a table.
  2. The existing data will be re-written in a different order upon reclustering. Snowflake saves the previous order for 7 days in order to provide Fail-safe protection, one of Snowflake’s Cloud Data Warehouse’s most lauded features.
  3. Reclustering a table on Snowflake costs additional dollars. This is directly proportional to the size of the data set that is re-ordered.

Data Storage Considerations

Snowflake provides an array of features for data that is stored. Continuous Data Protection (CDP) which includes Fail-Safe and Time Travel is given to all Snowflake accounts for no additional cost. This does not mean that CDP will not have an impact on your storage costs. On the contrary, it will.

Note, that your account will be charged for all the data stored in schemas, tables, and databases created in your Snowflake Architecture. Based on the data stored and the duration for which it is stored, CDP has an impact on the storage costs.

Until the data leaves the Fail-safe state, storage costs will be incurred. This means that you pay for the data storage irrespective of whether it is in Active, Time-travel or Fail-safe State. Hence, it is important to make the right storage considerations.

1. Staged File Storage While Loading Data

To assist in loading bulk data into tables during the Snowflake ETL process,  it has a feature called Stages where files that have the data to be loaded are staged. Snowflake allows both internal (within Snowflake) and external (S3, Azure) stages. No additional cost is charged for Time Travel and Fail-safe features for data stored in internal stages within Snowflake. However, standard data storage costs apply. For bulk data, you could utilize this feature.

2. Cloning Tables, Schemas, and Databases

Snowflake has a zero-copy cloning feature that gives an easy way to take a “snapshot” of any schema, table, or database. This feature creates a derived copy of that object which initially shares the underlying storage. This can come in handy when creating instant backups. This does not incur any additional costs as long as you do not need to make any changes to the cloned object.

Data Loading Considerations

These processes consist of transforming raw data to the format of Data Warehouse. Snowflake Architecture makes it time-consuming for users to load data to the Data Warehouse due to data mapping complexity. Hevo provides auto-mapping features for a seamless Snowflake ETL process.

1. Preparing Your Data Files

The foremost stage of performing the Snowflake ETL process is preparing the data files that you want to load into Data Warehouse.

2. General File Sizing Recommendations

It is important to optimize the number of parallel loads into Snowflake. It is recommended to create compressed data files that are roughly 10 MB to 100 MB in size.

Aggregate the smaller files to reduce processing overhead. Split the large files into a number of smaller files for faster load. This allows you to distribute the load between servers in the active Snowflake warehouse.

3. Data Size Limitations for Semi-Structured Data

The VARIANT data type has a 16 MB (compressed) size limit on the individual rows.

Often JSON and Avro are the most commonly used data formats. Both JSON and Avro are a concatenation of many documents. The source software that provides JSON or Avro output will provide the output in the form of a single huge array having multiple records. Both line breaks and commas are supported for document separation.

For efficiency enhancement, while executing the COPY INTO <table> command it is recommended to enable the STRIP_OUTER_ARRAY file format option. The Snowflake ETL will load the records into separate table rows by removing the outer array structure. Below is an example:

COPY INTO <table_name>
FROM @~/<file_name>.json
file_format = (type = 'JSON' strip_outer_array = true);

4. Data Size Limitations of Parquet files

It is recommended to split parquet files that are greater than 3GB in size into smaller files of 1GB or lesser for smooth transfer. This will ensure that the loading does not time out.

5. Preparing Delimited Text Files

The following points must be considered while preparing CSV/Delimited text files for Snowflake ETL:

  • Files must have data in ASCII format only. The default character set is UTF-8. However, additional encodings can be mentioned using ENCODING file format option.
  • Within the files, records and fields should be delimited by different characters. Note, that both should be a single (necessarily not same) character. Pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (n) is used as a row delimiter.
  • Fields that have delimiters should be enclosed in single or double quotes. If the data being loaded contains quotes, then those must be escaped.
  • Fields that have carriage returns (r n) should be enclosed in single or double quotes too. In the windows system, carriage returns are commonly introduced along with a line feed character to mark the end of a line.
  • Each row should have the same number of columns.

Planning a Data Load

Optimizing the data load of large data sets goes a long way in optimizing the Snowflake ETL process. When dealing with large data sets, it is advised to dedicate individual Snowflake Clusters for loading and querying operations. This helps in optimizing the overall performance.

A standard Virtual Warehouse is enough as loading requires fewer resources. Based on the speed at which you want to load data, you can choose the size of the Data Warehouse. Remember to split large data files for faster loading.

Integrate Aftership to Snowflake
Integrate Amazon S3 to Snowflake
Integrate Amazon DocumentDB to Snowflake

Data Staging Considerations

Both Snowflake and your data source (Azure/S3) allow stage references via paths. It is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers or geographical location, etc., along with the date when the data was written.

This exercise will provide you the flexibility to copy files by path using a single command. This will allow you to take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.

Loading Data

Data is loaded into Snowflake using the COPY INTO <table> command. Understanding the levers of this command can go a long way in helping you optimize the Snowflake ETL process. As an alternative to this time-consuming process, you can also use the No Code Data Pipeline tool such as Hevo to automate the whole process.

In order to execute the COPY command data from staged files to an existing table. The possible staging locations are as follows:

  • Internal stage (table/user stage). PUT command can be used to stage files.
  • AWS S3 bucket or Microsoft Azure container can be referenced from an external stage.
  • An External location (AWS S3 bucket or Microsoft Azure container)

Options for Selecting Staged Data Files:

The COPY command accepts several options for loading data from a stage. This will enable you to bring only a fraction of data that is staged into Snowflake. Here are some options:

  • Mention internal stages by path
  • Use prefix to load data from Amazon S3 bucket
  • Mention specific files to load
  • Load data from specific files by using pattern matching

For details on the implementation of each of these, refer to Snowflake documentation.

Note that, as a general practice, while performing data loads for Snowflake ETL, it is important to partition the data in your Snowflake or external locations like S3 buckets or Azure containers using logical, granular paths. By creating a partition using details like location or application along with the date when this data was written, you are optimizing for a later data loading activity.

When you load the data using Snowflake ETL, you can simply copy any fraction of the partitioned data into Snowflake with a single command. You can copy data into Snowflake by the hour, day, month, or even year when you initially populate tables.

Conclusion

In this article, you read about Snowflake ETL and how it helps users load data from data sources to Data Warehouse. Also, you read about some of the best practices to follow while performing the Snowflake ETL process. This can be a cumbersome process when it comes to manually transforming and monitoring the data loading.

Enhance your system with Snowflake performance tuning tips from our detailed guide on optimizing Snowflake queries and processes.

Hevo offers an entirely Automated, Hassle-Free Data Integration with 150+ Data Sources (including 60+ free sources) to deliver your data in real time without any loss from source to Snowflake. It empowers you to leverage & mine your extracted data to run Predictive Analytics and gain actionable insights with ease!

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Share your experience of understanding Snowflake ETL Best Practices in the comment section below! We would love to hear your thoughts.

FAQs

1. What is Snowflake tasks for ETL?

Snowflake Tasks are automated jobs that run SQL statements on a scheduled basis. They help streamline ETL processes by allowing you to schedule regular data transformations and loading operations without manual intervention.

2. How to build an ETL pipeline in Snowflake?

To build an ETL pipeline in Snowflake, first, extract data from various sources, then transform it using SQL queries or other tools, and finally load it into Snowflake tables. You can use Snowflake Tasks and Streams to automate and manage this process efficiently.

3. What is Snowflake used for?

Snowflake is a cloud-based data platform used for data warehousing, analytics, and data sharing. It allows organizations to store, manage, and analyze large volumes of data efficiently, making it easier to gain insights and make data-driven decisions.

mm
Software Engineer, Hevo Data

With around a decade of experience, Sarad has designed and developed fundamental components of Hevo. His expertise lies in building lean solutions for various software problems, mentoring fellow engineers and exploring new technologies