Connecting Amazon S3 to Snowflake: 2 Easy Methods

on Data Warehouse, Engineering, ETL, Hevo • February 24th, 2020 • Write for Hevo

S3 to Snowflake

This article talks about a specific Data Engineering scenario where data gets moved from the popular Amazon S3 to Snowflake, a well-known cloud Data Warehousing Software. Before we dive deeper into understanding the steps, let us first understand these individual systems.  

Table of Contents

Prerequisites

You will have a much easier time understanding the ways for setting up the Amazon S3 to Snowflake Integration if you have gone through the following aspects:

  • An active account on Amazon Web Services.
  • An active account on Snowflake.
  • Working knowledge of Databases and Data Warehouses.
  • Clear idea regarding the type of data to be transferred.

Introduction to Amazon S3

Amazon S3 Logo
Image Source

Amazon Simple Storage Service or Amazon S3 is file storage fully managed by Amazon and available as part of its suite of data services called Amazon Web Services (AWS). It is highly flexible in terms of user requirements, hence, you can either use the basic, minimal storage options for small data pipelines, or you can scale up to tens of terabytes of S3 data storage for large-scale Data Engineering scenarios. It also offers competitive, pay-as-you-go pricing for its users.

To know more about Amazon S3, visit this link.

Introduction to Snowflake

Snowflake Logo
Image Source

Snowflake is a popular cloud Data Warehousing solution that provides a whole host of services to work seamlessly with Amazon S3. Snowflake is a relational Data Warehouse that supports some semi-structured data formats like JSON, Parquet, etc. as well.

To know more about Snowflake, visit this link.

Methods to Set up Amazon S3 to Snowflake Integration

Method 1: Manual ETL Process to Set up Amazon S3 to Snowflake Integration

This approach would need you to invest in heavy engineering resources. The broad steps in this approach would need you to understand the S3 Data Source, write code to extract data from S3, prepare the data, and finally copy it into Snowflake. The details and challenges of each step are described in the next sections.

Method 2: Using Hevo Data to Set up Amazon S3 to Snowflake Integration

Hevo Data is an automated Data Pipeline platform that can move your data from Amazon S3 to Snowflake very quickly without writing a single line of code. It is simple, hassle-free, and reliable.

Moreover, Hevo offers a fully-managed solution to set up data integration from 100+ data sources (including 30+ free data sources) and will let you directly load data to a Data Warehouse such as Snowflake, Amazon Redshift, Google BigQuery, etc. or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its Fault-Tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get Started with Hevo for Free

Methods to Set up Amazon S3 to Snowflake Integration

This article delves into both the manual and using Hevo methods in depth. You will also see some of the pros and cons of these approaches and would be able to pick the best method based on your use case. Below are the two methods:

Download the Cheatsheet on How to Set Up ETL to Snowflake
Download the Cheatsheet on How to Set Up ETL to Snowflake
Download the Cheatsheet on How to Set Up ETL to Snowflake
Learn the best practices and considerations for setting up high-performance ETL to Snowflake

Method 1: Manual ETL Process to Set up Amazon S3 to Snowflake Integration

You can follow the below-mentioned steps to manually set up Amazon S3 to Snowflake Integration:

Step 1: Configuring an S3 Bucket for Access

To authenticate access control to an S3 bucket during a data load/unload operation, Amazon Web Services provides an option to create Identity Access Management (IAM) users with the necessary permissions. An IAM user creation is a one-time process that creates a set of credentials enabling a user to access the S3 bucket(s).

In case there are a larger number of users, another option is to create an IAM role and assign this role to a set of users. The IAM role will be created with the necessary access permissions to an S3 bucket, and any user having this role can run data load/unload operations without providing any set of credentials.

Step 2: Data Preparation

There are a couple of things to be kept in mind in terms of preparing the data. They are:

  • Compression: Compression of files stored in the S3 bucket is highly recommended, especially for bigger data sets, to help with the smooth and faster transfer of data. Any of the following compression methods can be used:
    • gzip
    • bzip2
    • brotli
    • zstandard
    • deflate
    • raw deflate
  • File Format: Ensure the file format of the data files to be loaded matches with a file format from the table below – 
StructuredSemi-structured
Delimited (CSV, TSV, etc.)JSON
 Avro
 ORC
 Parquet
 XML

Step 3: Copying Data from S3 Buckets to the Appropriate Snowflake Tables

Data copy from S3 is done using a ‘COPY INTO’ command that looks similar to a copy command used in a command prompt or any scripting language. It has a ‘source’, a ‘destination’, and a set of parameters to further define the specific copy operation.
The two common ways to copy data from S3 to Snowflake are using the file format option and the pattern matching option: File format -Here’s an example :

copy into abc_table   
from s3://snowflakebucket/data/abc_files 
credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY')   file_format = (type = csv field_delimiter = ',');

Pattern Matching –

copy into abc_table   
from s3://snowflakebucket/data/abc_files 
credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY')   pattern='*test*.csv';

Step 4: Set up Automatic Data Loading using Snowpipe

As running COPY commands every time a data set needs to be loaded into a table is infeasible, Snowflake provides an option to automatically detect and ingest staged files when they become available in the S3 buckets. This feature is called automatic data loading using Snowpipe. Here are the main features of a Snowpipe –

  1. Snowpipe can be set up in a few different ways to look for newly staged files and load them based on a pre-defined COPY command. An example here is to create a Simple-Queue-Service notification that can trigger the Snowpipe data load.
  2. In the case of multiple files, Snowpipe appends these files into a loading queue. Generally, the older files are loaded first, however, this is not guaranteed to happen.
  3. Snowpipe keeps a log of all the S3 files that have already been loaded – this helps it identify a duplicate data load and ignore such a load when it is attempted.

Step 5: Managing Data Transformations During the Data Load from S3 to Snowflake

One of the cool features available on Snowflake is its ability to transform the data during the data load. In traditional ETL, data is extracted from one source and loaded into a stage table in a one-to-one fashion. Later on, transformations are done during the data load process between the stage and the destination table. However, with Snowflake, the intermediate stage table can be ignored, and the following data transformations can be performed during the data load –

  1. Reordering of columns: The order of columns in the data file doesn’t need to match the order of the columns in the destination table.
  2. Column omissions: The data file can have fewer columns than the destination table, and the data load will still go through successfully.
  3. Circumvent column length discrepancy: Snowflake provides for options to truncate the string length of data sets in the data file to align with the field lengths in the destination table. The above transformations are done through the use of select statements while performing the COPY command. These select statements are similar to how select SQL queries are written to query database tables, the only difference here is, the select statements pull certain data from a staged data file (instead of a database table) in an S3 bucket. Here is an example of a COPY command using a select statement to reorder the columns of a data file before going ahead with the actual data load –
copy into abc_table(ID, name, category, price)    
from (select x.$1, x.$3, x.$4, x.$2  from @s3snowflakestage x)    file_format = (format_name = csvtest);

In the above example, the order of columns in the data file is different from that of the abc_table, hence, the select statement calls out specific columns using the $*number* syntax to match with the order of the abc_table.

Limitations of Manual ETL Process to Set up Amazon S3 to Snowflake Integration

After reading this blog, it may appear as if writing custom ETL scripts to achieve the above steps to move data from S3 to Snowflake is not that complicated. However, in reality, there is a lot that goes into building these tiny things in a coherent, robust way to ensure that your data pipelines are going to function reliably and efficiently. Some specific challenges to that end are – 

  • Other than one-of-use of the COPY command for specific, ad-hoc tasks, as far as data engineering and ETL goes, this whole chain of events will have to be automated so that real-time data is available as soon as possible for analysis. Setting up Snowpipe or any similar solution to achieve that reliably is no trivial task.
  • On top of setting up and automating these tasks, the next thing a growing data infrastructure is going to face is scaling. Depending on the growth, things can scale up really quickly and if you don’t have a dependable, data engineering backbone that can handle this scale, it can become a problem. 
  • With functionalities to perform data transformations, a lot can be done in the data load phase, however, again with scale, there are going to be so many data files and so many database tables, at which point, you’ll need to have a solution already deployed that can keep track of these updates and stay on top of it.

Method 2: Using Hevo Data to Set up Amazon S3 to Snowflake Integration

Hevo Banner
Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Amazon S3 and 100+ other data sources to Data Warehouses such as Snowflake, Databases, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo Data takes care of all your data preprocessing needs and lets you focus on key business activities and draw a much more powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Loading data into Snowflake using Hevo is easier, reliable, and fast. Hevo is a no-code automated data pipeline platform that solves all the challenges described above.

For any information on Amazon S3 Logs, you can visit the former link.

Sign up here for a 14-Day Free Trial!

You can move data from Amazon S3 to Snowflake by following 3 simple steps without writing any piece of code. 

  • Connect to Amazon S3 source by providing connection settings.
Configuring the S3 as Source in Hevo Data
Image Source
  • Select the file format (JSON/CSV/AVRO) and create schema folders.
  • Configure Snowflake Warehouse.
Configuring the Snowflake as Destination in Hevo Data
Image Source

Hevo will take all groundwork of moving data from Amazon S3 to Snowflake in a Secure, Consistent, and Reliable fashion. 

Here are more reasons to try Hevo:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Conclusion

Amazon S3 to Snowflake is a very common data engineering use case in the tech industry. As mentioned in the custom ETL method section, you can set things up on your own by following a sequence of steps, however, as mentioned in the challenges section, things can get quite complicated and a good number of resources may need to be allocated to these tasks to ensure consistent, day-to-day operations. 

Visit our Website to Explore Hevo

Businesses can use automated platforms like Hevo Data to set this integration and handle the ETL process. It helps you directly transfer data from a source of your choice to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you a hassle-free experience.

Want to try Hevo? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. Have a look at our unbeatable pricing, which will help you choose the right plan for you.

Share your experience of setting up Amazon S3 to Snowflake Integration in the comments section below!

No-code Data Pipeline for Snowflake