This article talks about a specific Data Engineering scenario where data gets moved from the popular Amazon S3 to Snowflake, a well-known cloud Data Warehousing Software. Before we dive deeper into understanding the steps, let us first understand these individual systems.
You will have a much easier time understanding the ways for setting up the Amazon S3 to Snowflake Integration if you have gone through the following aspects:
- An active account on Amazon Web Services.
- An active account on Snowflake.
- Working knowledge of Databases and Data Warehouses.
- Clear idea regarding the type of data to be transferred.
How to Set up Amazon S3 to Snowflake Integration?
This article delves into both the manual and using Hevo methods in depth. You will also see some of the pros and cons of these approaches and would be able to pick the best method based on your use case.
Let us now go through the two methods:
Method 1: Manual ETL Process to Set up Amazon S3 to Snowflake Integration
You can follow the below-mentioned steps to manually set up Amazon S3 to Snowflake Integration:
Step 1: Configuring an S3 Bucket for Access
To authenticate access control to an S3 bucket during a data load/unload operation, Amazon Web Services provides an option to create Identity Access Management (IAM) users with the necessary permissions.
An IAM user creation is a one-time process that creates a set of credentials enabling a user to access the S3 bucket(s).
In case there are a larger number of users, another option is to create an IAM role and assign this role to a set of users. The IAM role will be created with the necessary access permissions to an S3 bucket, and any user having this role can run data load/unload operations without providing any set of credentials.
Step 2: Data Preparation
There are a couple of things to be kept in mind in terms of preparing the data. They are:
- Compression: Compression of files stored in the S3 bucket is highly recommended, especially for bigger data sets, to help with the smooth and faster transfer of data. Any of the following compression methods can be used:
- raw deflate
- File Format: Ensure the file format of the data files to be loaded matches with the file format from the table below –
|Delimited (CSV, TSV, etc.)||JSON|
Step 3: Copying Data from S3 Buckets to the Appropriate Snowflake Tables
Data copy from S3 is done using a ‘COPY INTO’ command that looks similar to a copy command used in a command prompt or any scripting language. It has a ‘source’, a ‘destination’, and a set of parameters to further define the specific copy operation.
The two common ways to copy data from S3 to Snowflake are using the file format option and the pattern matching option: File format -Here’s an example :
copy into abc_table
credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY') file_format = (type = csv field_delimiter = ',');
Pattern Matching –
copy into abc_table
credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY') pattern='*test*.csv';
Step 4: Set up Automatic Data Loading using Snowpipe
As running COPY commands every time a data set needs to be loaded into a table is infeasible, Snowflake provides an option to automatically detect and ingest staged files when they become available in the S3 buckets. This feature is called automatic data loading using Snowpipe.
Here are the main features of a Snowpipe –
- Snowpipe can be set up in a few different ways to look for newly staged files and load them based on a pre-defined COPY command. An example here is to create a Simple-Queue-Service notification that can trigger the Snowpipe data load.
- In the case of multiple files, Snowpipe appends these files into a loading queue. Generally, the older files are loaded first, however, this is not guaranteed to happen.
- Snowpipe keeps a log of all the S3 files that have already been loaded – this helps it identify a duplicate data load and ignore such a load when it is attempted.
Step 5: Managing Data Transformations During the Data Load from S3 to Snowflake
One of the cool features available on Snowflake is its ability to transform the data during the data load. In traditional ETL, data is extracted from one source and loaded into a stage table in a one-to-one fashion. Later on, transformations are done during the data load process between the stage and the destination table. However, with Snowflake, the intermediate stage table can be ignored, and the following data transformations can be performed during the data load –
- Reordering of columns: The order of columns in the data file doesn’t need to match the order of the columns in the destination table.
- Column omissions: The data file can have fewer columns than the destination table, and the data load will still go through successfully.
- Circumvent column length discrepancy: Snowflake provides for options to truncate the string length of data sets in the data file to align with the field lengths in the destination table. The above transformations are done through the use of select statements while performing the COPY command. These select statements are similar to how select SQL queries are written to query database tables, the only difference here is, the select statements pull certain data from a staged data file (instead of a database table) in an S3 bucket. Here is an example of a COPY command using a select statement to reorder the columns of a data file before going ahead with the actual data load –
copy into abc_table(ID, name, category, price)
from (select x.$1, x.$3, x.$4, x.$2 from @s3snowflakestage x) file_format = (format_name = csvtest);
In the above example, the order of columns in the data file is different from that of the abc_table, hence, the select statement calls out specific columns using the $*number* syntax to match with the order of the abc_table.
Limitations of Manual ETL Process to Set up Amazon S3 to Snowflake Integration
After reading this blog, it may appear as if writing custom ETL scripts to achieve the above steps to move data from S3 to Snowflake is not that complicated. However, in reality, there is a lot that goes into building these tiny things in a coherent, robust way to ensure that your data pipelines are going to function reliably and efficiently. Some specific challenges to that end are –
- Other than one-of-use of the COPY command for specific, ad-hoc tasks, as far as data engineering and ETL goes, this whole chain of events will have to be automated so that real-time data is available as soon as possible for analysis. Setting up Snowpipe or any similar solution to achieve that reliably is no trivial task.
- On top of setting up and automating these tasks, the next thing a growing data infrastructure is going to face is scaling. Depending on the growth, things can scale up really quickly and if you don’t have a dependable, data engineering backbone that can handle this scale, it can become a problem.
- With functionalities to perform data transformations, a lot can be done in the data load phase, however, again with scale, there are going to be so many data files and so many database tables, at which point, you’ll need to have a solution already deployed that can keep track of these updates and stay on top of it.
Method 2: Using Hevo Data to Set up Amazon S3 to Snowflake Integration
Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Amazon S3 and 150+ other data sources to Data Warehouses such as Snowflake, Databases, BI tools, or a destination of your choice in a completely hassle-free & automated manner.
Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.
Hevo Data takes care of all your data preprocessing needs and lets you focus on key business activities and draw a much more powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always has analysis-ready data in your desired destination.
Loading data into Snowflake using Hevo is easier, more reliable, and fast. Hevo is a no-code automated data pipeline platform that solves all the challenges described above.
For any information on Amazon S3 Logs, you can visit the former link.
Sign up here for a 14-Day Free Trial!
You can move data from Amazon S3 to Snowflake by following 3 simple steps without writing any piece of code.
- Connect to Amazon S3 source by providing connection settings.
- Select the file format (JSON/CSV/AVRO) and create schema folders.
- Configure Snowflake Warehouse.
Hevo will take all groundwork of moving data from Amazon S3 to Snowflake in a Secure, Consistent, and Reliable fashion.
Here are more reasons to try Hevo:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Download the Cheatsheet on How to Set Up ETL to Snowflake
Learn the best practices and considerations for setting up high-performance ETL to Snowflake
Amazon S3 to Snowflake is a very common data engineering use case in the tech industry. As mentioned in the custom ETL method section, you can set things up on your own by following a sequence of steps, however, as mentioned in the challenges section, things can get quite complicated and a good number of resources may need to be allocated to these tasks to ensure consistent, day-to-day operations.
Visit our Website to Explore Hevo
Businesses can use automated platforms like Hevo Data to set this integration and handle the ETL process. It helps you directly transfer data from a source of your choice to a Data Warehouse, Business Intelligence tools, or any other desired destination in a fully automated and secure manner without having to write any code and will provide you a hassle-free experience.
Want to try Hevo?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. Have a look at our unbeatable pricing, which will help you choose the right plan for you.
Share your experience of setting up Amazon S3 to Snowflake Integration in the comments section below!