AWS Glue is a Cloud-optimized ETL (Extract, Transform, and Load) Service from Amazon Web Services (AWS). It allows you to organize, locate, move and transform all your Dataset across your business, so you can put them to use. Although it gives you the best results when used independently, it is always recommended to integrate with a reliable Data Warehouse. AWS Glue Snowflake Integration is one of the most popular integrations.
AWS Glue Snowflake Integration makes sure that your data is securely stored in a centralized location. AWS Glue Snowflake Integration makes it easier for you to fetch your data when needed and use it to extract meaningful insights.
This article will introduce you to AWS Glue Snowflake Integration. It will also provide a comprehensive guide that you can use while setting up AWS Glue Snowflake Integration. The guide houses all the steps including the AWS Glue Snowflake configuration.
Table of Contents
- Introduction to AWS Glue
- Introduction to Snowflake
- AWS Glue Configuration for AWS Glue Snowflake Integration
- Snowflake Configuration for AWS Glue Snowflake Integration
- Steps to Set Up AWS Glue Snowflake Integration
- Step 1: Creating a Connection between AWS Glue and Snowflake
- Step 2: Creating a Connection from Snowflake to S3 ETL Job
Here are the prerequisites of setting up AWS Glue Snowflake Integration:
- A Snowflake account (you can sign up for a 30-day free trial, that will be sufficient for this tutorial).
- An AWS account (Note that AWS Glue ETL jobs are not covered under the free tier, so you will incur a nominal charge while following this tutorial. I’ll be providing instructions for keeping this cost as low as possible. If it all works out in the first attempt for you, your total cost should be < $0.1).
- Familiarity with AWS IAM roles and AWS S3 will help.
Introduction to AWS Glue
The demand for cross-platform data exchange is growing by the day. You may have data stored in someplace, in a certain format, and may want to process it in some other place, in a different format. In situations like these, managed ETL (Extract, Transform, Load) services like AWS Glue come into the picture.
AWS Glue is Serverless which means that you don’t need to take the headache of provisioning and managing servers, and you get charged only for the time taken by your ETL Jobs. It even comes with a visual interface that allows you to define your ETL jobs without writing a single line of code.
For more information on AWS Glue, click here.
Introduction to Snowflake
Snowflake is a Data Warehousing solution provided as a SaaS (Software-as-a-Service) offering. It uses AWS, Azure, or Google Cloud in the background (you need to select one of these Cloud Service Providers, and the region while signing up). It de-hyphenates compute and storage, allowing you to pay separately for the two. Like other Data Warehouses, it uses Columnar Storage for parallel query execution.
For more information on Snowflake, click here.
Load Data to Snowflake Seamlessly Using Hevo’s No Code Data Pipeline
Hevo Data, an Automated No Code Data Pipeline empowers you to transfer data to Snowflake in a hassle-free manner. You can easily ingest data from numerous sources using Hevo’s Data Pipelines and replicate it to your Snowflake account without writing a single line of code. Hevo’s end-to-end Data Management service automates the process of not only loading your data but also transforming and enriching it into an analysis-ready form when it reaches Snowflake. Once you assign Hevo the required Role Permissions in Snowflake, you can rely on its fully managed Data Pipeline to safely transit your data in a lossless manner.
Hevo supports direct integrations with 100+ sources (including 40 free sources) and its Data Mapping feature works continuously to replicate your data to Snowflake and builds a single source of truth for your business. With Hevo you can choose exactly which data you wish to replicate to Snowflake and at what frequency. Hevo’s fault-tolerant architecture will enrich and transform your data in a secure and consistent manner and load it to Snowflake without any assistance from your side. Since Hevo is an official Snowflake Data Pipeline Partner, you can entrust us with your data transfer process and enjoy a hassle-free experience. This way, you can focus more on Data Analysis, instead of data consolidation.
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that your data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to your Snowflake schema.
- Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use for aggregation.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
With continuous Real-Time data movement, Hevo allows you to assemble data from multiple sources and seamlessly load it to Snowflake with a no-code, easy-to-setup interface. Try our 14-day full-feature access free trial!Sign up here for a 14-day Free Trial!
AWS Glue Configuration for AWS Glue Snowflake Integration
AWS Glue needs configuration before setting up AWS Glue Snowflake Integration. Follow the steps below to configure AWS Glue for AWS Glue Snowflake Integration:
Step 1: Creating the IAM Role
If you don’t have a suitable IAM Role already, you will need to create one. Follow the steps below to create a suitable IAM Role:
- On the AWS Console, go to Services > IAM.
- Click on “Roles” from the left menu, and then click on “Create role”.
- In the subsequent screen, choose “AWS service”, and for the use case, select “Glue” from the list below, and click on “Next: Permissions” as shown in the image below.
- On the permissions screen, search for Glue, and add all the permissions. Then search for S3 and add the “AmazonS3FullAccess” permission as shown in the image below.
- Ideally, the “AmazonS3FullAccess” permission is overkill. If your IAM Role can access the relevant bucket(s), that works as well. But for the sake of simplicity, let’s provide our IAM role full access to S3. You can skip the “Tags” screen.
- In the final “Review” screen, provide a suitable name for your role (let’s call it “glue_s3_role”), and click on “Create Role” as shown in the image below.
Step 2: Adding JDBC Connector to S3
You will need to download the Snowflake JDBC Connector jar file. You can download it from here. For instance, you can download version 3.9.2 as shown in the image below.
If you are unaware, JDBC stands for Java Database Connectivity, and in very simple terms, is an API that allows Java programs to access a database. You can read more about it here.
Once downloaded, upload the jar file to a bucket in S3. If you don’t have an S3 bucket, then create one (in the same region where you wish to run your Glue jobs). After the upload is done, note down the S3 URI of the jar file. It will be used in the integration part as shown in the image below.
That’s it. You have successfully configured AWS Glue for AWS Glue Snowflake Integration.
Snowflake Configuration for AWS Glue Snowflake Integration
Similar to AWS Glue, Snowflake also needs to be configured for AWS Glue Snowflake Integration. Once you create a Snowflake account, follow the steps below to configure it for AWS Glue Snowflake Integration:
Step 1: Creating a New Data Warehouse
Follow the steps below to create a new Data Warehouse:
- Within Snowflake, go to the “Warehouses” tab, and click on “Create”. Give it a suitable name (here let’s call it “glue_integration_test”).
- Leave all other setup parameters to their default values, except for the Size parameter, where you can opt for the X-Small size, to save credits.
- Click on “Finish” once done. Your Data Warehouse will be created as shown in the image below.
Step 2: Creating a New Database
Go to the “Databases” tab, and click on ‘Create’. Give it a suitable name. I’ll call it “glue_integration_test_db” and click on “Finish” as shown in the image below.
Step 3: Creating a New Table
Follow the steps below to create a new table:
- Click on the Database just created. Click on “Create”. The table creation popup opens up.
- Give the table a suitable name (here, let’s call it “glue_integration_test_table”). You can leave the schema as public.
- Now, you need to define columns. In this article, you will be copying data from the SNOWFLAKE_SAMPLE_DATA.WEATHER.DAILY_14_TOTAL table in Snowflake into a file in S3. Then, using that S3 file, you will populate this table. If you observe the “daily_14_total” table, it has 2 columns: V and T. Thus, you can define 2 columns with the same name here, and leave their types as STRING as shown in the image below.
- Click on “Finish”. Your table will be created. This ends the Snowflake setup required for this article.
That’s it. You have successfully configured Snowflake for AWS Glue Snowflake Integration.
Steps to Set Up AWS Glue Snowflake Integration
The setup of AWS Glue Snowflake Integration demands technical proficiency as all the tasks need to perform manually. It majorly has 3 steps. Listed below are the steps to setup AWS Glue Snowflake Integration:
- Step 1: Creating a Connection between AWS Glue and Snowflake
- Step 2: Creating a Connection from Snowflake to S3 ETL Job
- Step 3: Creating a Connection from S3 to Snowflake S3 Job
Let’s walk through these steps in detail.
Step 1: Creating a Connection between AWS Glue and Snowflake
The 1st step for setting up AWS Glue Snowflake Integration is rather easy. Follow the steps below to create a connection between AWS Glue and Snowflake:
1) Creating a Connector
Now, in order to interface with Snowflake, we first need to create a Connector. Go to AWS Glue, and click on “AWS Glue Studio” from the left menu. Once in the glue studio, click on “View Connectors” and then click on “Create Custom Connector” as shown in the image below.
In the form that opens up, enter the following:
- For the connector S3 URL, enter the S3 URI of the JDBC Connector noted in the AWS Setup section.
- Give the connector a name. I’ll call it ‘snowflake_jdbc_connector’
- Set ‘JDBC’ as the connector type.
- For the class name, enter “net.snowflake.client.jdbc.SnowflakeDriver”.
- For the JDBC URL base, enter
- In place of <your_snowflake_account_info>, enter the actual Snowflake account URL (it will be the format <account_identifier>.snowflakecomputing.com. It can have some variations. For instance, here the Snowflake account is based on AWS and is in the us-east-2 region (Ohio). Therefore, the account info looks like: abcdefg.us-east-2.aws.snowflakecomputing.com, abcdefg representing the account identifier.
- For URL parameter delimited, enter ‘&’.
- You can add an optional description as shown in the image below.
2) Creating a Connection
Once your Connector is created, follow the steps below to create a connection:
- Click on the connector you created in the last step, and click on “Create Connection”.
- In the form that opens up, provide a name to the connection, and an optional description (here, let’s the connection name be “snowflake_jdbc_connection”).
- Next, in the “Connection Options” tab, choose the credentials type as default and then enter values for the Snowflake username, password, and Warehouse (enter the name of the Warehouse created in the Snowflake Configuration section above) as shown in the image below.
Note that the better method is to use the Secrets Manager to store and retrieve your credentials. However, that is beyond the scope of this article. You can learn more about the Secrets Manager here.
- Once your credentials and Warehouse values are entered, click on “Save Changes”. Your connection will successfully be created.
Step 2: Creating a Connection from Snowflake to S3 ETL Job
Follow the steps below to create a connection between Snowflake and S3 ETL Job for AWS Glue Snowflake Integration:
1) Setting Up Job Details
- Once the connection is created, click on the connection, and click on “Create Job”.
- Go to the “Job Details” tab, and fill the form as follows:
- Give it a name (in this article, let’s name it “testing_snowflake_integration”).
- For the IAM Role, select the role created in the AWS Setup step.
- Type, Glue version, Language, and Worker type can be left as default.
- The number of workers can be set to 2 instead of 10. This can help save the cost. Each worker (DPU) contributes to the cost. Minimum 2 are required.
- Job bookmark can be set to “Disable” since we won’t care about the state information.
- The number of retries can be set to 0, again for saving cost.
- Job timeout can be reduced to a lower value (say 5), again to make sure that you don’t end up spending a lot in case something goes wrong.
- Click on “Save” (top-right) as shown in the image below.
2) Creating the Job Visually
Follow the steps below to create the Job visually:
- Go to the “Visual” tab. You will see that a node for the snowflake JDBC connector has already been created.
- Click on it and click on the “Data source properties – Connector” tab.
- Select “Write a query” and copy the following in the query code:
SELECT * from SNOWFLAKE_SAMPLE_DATA.WEATHER.DAILY_14_TOTAL LIMIT 10
- You are basically querying 10 items from the daily_14_total table in the weather schema of the snowflake_sample_data DB. This DB is available to you whenever you create your free Snowflake account as shown in the image below.
- Next, click on “Target” and select ‘Amazon S3’. Click on it, and within “Node Properties”, select the above data source within the “Node Parents” dropdown. You will see an arrow linking the source to the target.
- Next, within “Data Target Properties – S3”, set the format to JSON, and set the S3 Target Location to a bucket or a directory within a bucket in S3 where you wish to store the data fetched from Snowflake as shown in the image below. Make sure your IAM Role has access to that bucket.
- Now, click on “Save” in the top right, and then click on “Run”. You can observe the status in the “Runs” tab. If all goes well, you should see the “Succeeded” status, with an execution time in the ballpark of 1 minute as shown in the image below.
You should also be able to see the file created in S3 as shown in the image below.
If you download the file and open it in an editor, you will see that it is actually the JSON version of the table entries.
Step 3: Creating a Connection from S3 to Snowflake S3 Job
This is the final step to set up AWS Glue Snowflake Integration. Follow the steps below to create a connection from S3 to Snowflake S3 Job for AWS Glue Snowflake Integration:
1) Modification of the Connector
This step requires a slight modification of the Connector. Go to the connector, click on “Edit” and change the JDBC URL Base to:
You are essentially just adding the Database name as an additional field. Similarly, edit the connection, and provide the value of the table created in the “Snowflake Setup” section, for the db_name parameter.
2) Creating the Job Visually
Now, create a new Job. The job details will be similar to what you filled in Step 2: Creating a Connection from Snowflake to S3 ETL Job. Follow the steps below to create the Job visually:
- In the “Visual” tab, remove the snowflake_jdbc_connector node that appears by default.
- Click on the “Source” tab and select “Amazon S3”.
- Click on the node, and in the “Data Source Properties – S3” tab, set the source type as “S3 Location”, S3 URL as the URL of the JSON saved in the previous step, and data format as “JSON” as shown in the image below.
- Next, from the “Target” tab, select the “JDBC Connector”.
- Click on the node that appears and within the node properties, select “Amazon S3” from the dropdown. This should establish a connection between the source and the target.
- Within the data target properties, choose the connection defined earlier from the dropdown, and in the “Table name” field, mention the name of the table created in the “Snowflake Setup” section.
- Now, save this job, and click on “Run”. Monitor the status in the “Runs” tab. If all goes well and the job status is “Succeeded”, you should be able to see the data populated in your Snowflake table.
Note: If this table is not empty, then data will be appended to it. This means that if you run the job twice, the table will contain 20 rows. Also, note that the column names are important. If instead of V and T, you had left the column names to the default C1 and C2, you would have gotten an error like the one below:
An error occurred while calling o87.pyWriteDynamicFrame. Glue ETL Marketplace: JDBC Data Sink: Column “V” not found in schema Some(StructType(StructField(C1,StringType,true), StructField(C2,StringType,true)))
That’s it. You have successfully set up AWS Glue Snowflake Integration.
This article introduced you to AWS Glue Snowflake Integration. It also provided a comprehensive guide that you can use while manually setting up AWS Glue Snowflake Integration. The guide also provided all the steps including the AWS Glue Snowflake configuration. Although the above-given method to set up AWS Glue Snowflake Integration work incredibly, it demands technical proficiency. Furthermore, you will have to build an in-house solution from scratch if you wish to transfer your data from AWS Glue to Snowflake or another Data Warehouse for analysis.
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. You can leverage Hevo to seamlessly transfer data from a multitude of sources to Snowflake in real-time without writing a single line of code. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner. Hevo caters to 100+ data sources (including 40+ free sources) and can directly transfer data from these sources to Data Warehouses, Business Intelligence Tools, or any other destination of your choice in a hassle-free manner. It will make your life easier and make data migration hassle-free.Visit our Website to Explore Hevo
Want to take Hevo for a spin? Sign up for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
Share your experience of setting up AWS Glue Snowflake Integration in the comments section below!