This article talks about a specific data engineering scenario where data gets moved from the popular Amazon S3 to Snowflake, a well-known cloud data warehousing software. Before we dive deeper into understanding the steps, let us first understand these individual systems.
Overview of Amazon S3 and Snowflake
Snowflake is a popular cloud data warehousing solution that provides a whole host of services to work seamlessly with Amazon S3. Snowflake is a relational data warehouse that supports some semi-structured data formats like JSON, Parquet, etc. as well.
Moving Data from Amazon S3 to Snowflake
- Building Custom ETL Code This approach would need you to invest in heavy engineering resources. The broad steps in this approach would need you to understand the S3 data source, write code to extract data from S3, prepare the data and finally copy it into Snowflake. The details and challenges of each step are described in the next sections.
- Using a Ready Data Integration Platform like Hevo (Official Snowflake ETL Partner) Hevodata is a cloud data integration platform that enables the smooth transfer of data from S3 to Snowflake without having to write any code. By employing Hevodata, you can start focusing entirely on generating insights from your precious in-house data while Hevo takes care of making sure that the data is available and ready for analysis.
Moving data from Amazon S3 to Snowflake via custom ETL method
- S3 access configuration for loading data into Snowflake.
- Data preparation.
- Copying data from S3 buckets to the appropriate Snowflake tables.
- Set up automatic data loading using Snowpipe.
- Manage data transformations during the data load from S3 to Snowflake.
- Configuring an S3 Bucket for Access
- To authenticate access control to an S3 bucket during a data load/unload operation, Amazon Web Services provides an option to create Identity Access Management (IAM) users with the necessary permissions. An IAM user creation is a one-time process that creates a set of credentials enabling a user to access the S3 bucket(s).
- In case there are a larger number of users, another option is to create an IAM role and assign this role to a set of users. The IAM role will be created with the necessary access permissions to an S3 bucket, and any user having this role can run data load/unload operations without providing any set of credentials.
- Data Preparation
There are a couple of things to be kept in mind in terms of preparing the data. They are –
- Compression – Compression of files stored in the S3 bucket is highly recommended, especially for bigger data sets, to help with the smooth and faster transfer of data. Any of the following compression methods can be used –
– raw deflate
- File Format – Ensure the file format of the data files to be loaded matches with a file format from the table below –
|Delimited (CSV, TSV, etc.)||JSON|
- Copying Data from S3 Buckets
Data copy from S3 is done using a ‘COPY INTO’ command that looks similar to a copy command used in a command prompt or any scripting language. It has a ‘source’, a ‘destination’ and a set of parameters to further define the specific copy operation.
The two common ways to copy data from S3 to Snowflake are using the file format option and the pattern matching option:File format -Here’s an example :
copy into abc_table from s3://snowflakebucket/data/abc_files credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY') file_format = (type = csv field_delimiter = ',');
Pattern Matching –
copy into abc_table from s3://snowflakebucket/data/abc_files credentials=(aws_key_id='$KEY_ID' aws_secret_key='$SECRET_KEY') pattern='*test*.csv';
- Set up Automatic Data Loading
As running COPY commands every time a data set needs to be loaded into a table is infeasible, Snowflake provides an option to automatically detect and ingest staged files when they become available in the S3 buckets. This feature is called automatic data loading using Snowpipe. Here are the main features of a Snowpipe –
- Snowpipe can be set up in a few different ways to look for newly staged files and load them based on a pre-defined COPY command. An example here is to create a Simple-Queue-Service notification that can trigger the Snowpipe data load.
- In case of multiple files, Snowpipe appends these files into a loading queue. Generally, the older files are loaded first, however this is not guaranteed to happen.
- Snowpipe keeps a log of all the S3 files that have already been loaded – this helps it identify a duplicate data load and ignore such a load when it is attempted.
- Managing Data Transformations During the Data Load
One of the cooler features available on Snowflake is its ability to transform the data during the data load. In traditional ETL, data is extracted from one source and loaded into a stage table in a one-to-one fashion. Later on, transformations are done during the data load process between the stage and the destination table. However, with Snowflake, the intermediate stage table can be ignored, and the following data transformations can be performed during the data load –
- Reordering of columns – The order of columns in the data file doesn’t need to match the order of the columns in the destination table.
- Column omissions – The data file can have fewer columns than the destination table, and the data load will still go through successfully.
- Circumvent column length discrepancy – Snowflake provides for options to truncate the string length of data sets in the data file to align with the field lengths in the destination table.The above transformations are done through the use of select statements while performing the COPY command. These select statements are similar to how select SQL queries are written to query database tables, the only difference here is, the select statements pull certain data from a staged data file (instead of a database table) in an S3 bucket.Here is an example of a COPY command using a select statement to reorder the columns of a data file before going ahead with the actual data load –
copy into abc_table(ID, name, category, price) from (select x.$1, x.$3, x.$4, x.$2 from @s3snowflakestage x) file_format = (format_name = csvtest);
In the above example, the order of columns in the data file is different from that of the abc_table, hence, the select statement calls out specific columns using the $*number* syntax to match with the order of the abc_table.
Amazon S3 to Snowflake: Limitations and Challenges of Custom Code Approach
After reading this blog, it may appear as if writing custom ETL scripts to achieve the above steps to move data from S3 to Snowflake is not that complicated. However, in reality, there is a lot that goes into building these tiny things in a coherent, robust way to ensure that your data pipelines are going to function reliably and efficiently. Some specific challenges to that end are –
- Other than one-of uses of the COPY command for specific, ad-hoc tasks, as far as data engineering and ETL goes, this whole chain of events will have to be automated so that real-time data is available as soon as possible for analysis. Setting up snowpipe or any similar solution to achieve that in a reliable fashion is no trivial task.
- On top of setting up and automating these tasks, the next thing a growing data infrastructure is going to face is scaling. Depending on the growth, things can scale up really quickly and if you don’t have a dependable, data engineering backbone that can handle this scale, it can become a problem.
- With functionalities to perform data transformations, a lot can be done in the data load phase, however, again with scale, there are going to be so many data files and so many database tables, at which point, you’ll need to have a solution already deployed that can keep track of these updates and stay on top of it.
Hevo – A Hassle-free Alternative to Move Data from Amazon S3 to Snowflake
Implementing Hevo, a fully managed, simple-to-use Data Integration platform, would ensure that your data is reliably moved from Amazon S3 to Snowflake effortlessly. Hevo’s real-time streaming architecture ensures that you have the latest, up-to-date data in Snowflake at any point. Move your data from S3 to Snowflake in 3 simple steps:
- Connect to Amazon S3 source by providing connection settings
- Select the file format (JSON/CSV/AVRO) and create schema folders
- Configure Snowflake Warehouse
Hevo comes with pre-built integrations with a wide spectrum of data sources – Database (MySQL, PostgreSQL, MongoDB and more), Cloud Storage (FTP/SFTP, Google Drive, and more), Cloud Applications (Salesforce, Google Ads, Facebook Ads, and more). This makes Hevo the right partner to cater to your growing data needs.
In a Nutshell
Amazon S3 to Snowflake is a very common data engineering use case in the tech industry. As mentioned in the custom ETL method section, you can set things up on your own by following a sequence of steps, however, as mentioned in the challenges section, things can get quite complicated and a good number of resources may need to be allocated to these tasks to ensure consistent, day-to-day operations.
The other way to be sure of having set up a reliable ETL pipeline in a dynamic data environment is to use the services of Hevo. Hevo comes with a risk-free 14-day free trial. Sign up here to explore Hevo and start moving your data from S3 to Snowflake instantly.