Amazon S3 to Redshift: 3 Easy Methods

on Tutorial, Amazon Redshift, AWS S3, Data Integration, Data Warehouse, ETL • February 22nd, 2020 • Write for Hevo

AWS S3 is a completely managed general-purpose storage mechanism offered by Amazon based on a software as a service business model.  S3 can be used to serve any storage requirement ranging from a simple backup service to archiving a full data warehouse. It can be used for any requirement of up to 5 TB of data. In the enterprise data pipelines, it is typical to use S3 as a staging location or a temporary data dumping location before loading data into a Data Warehouse for offline Data Analysis. For customers staying within the AWS ecosystem, a Redshift is a great option as a completely managed Data Warehouse service.

This post will introduce you to Amazon S3 and Redshift. Moreover, it will explain 3 step-by-step methods which will help you to connect Amazon S3 to Redshift easily. Read along to understand more about the steps, benefits, and limitations of these methods.

Table of Contents

Prerequisites

  • An S3 account.
  • A Redshift account.
  • Working knowledge of Redhsift commands.

Introduction to Amazon S3

Amazon S3 Logo
Image Source

As mentioned above AWS S3 is a completely managed object storage service accessed entirely through web APIs and AWS-provided CLI utilities. Like any completely managed service offered by Amazon, all operational activities related to pre-provisioning, capacity scaling, etc are abstracted away from users. The customers are required to pay for the amount of space that they use. Access controls are comprehensive enough to meet typical compliance requirements. It offers granular access controls to meet all kinds of organizational and business compliance requirements. 

AWS Athena and AWS redshift spectrum allow users to run analytical queries on data stored in S3 buckets. S3 offers high availability. This comes from the fact that it stores data across a cluster of distributed servers. This approach means there is a related propagation delay and S3 can only guarantee eventual consistency. S3 writes are atomic though. 

Structurally, S3 is envisioned as buckets and objects. A bucket is a container for storing all kinds of objects. It also represents the highest level of namespace. Buckets contain objects which represent the basic storage entity. An object is a fusion of the stored object as well as its metadata. A unique key and version identify an object uniquely.

To learn more about Amazon S3, visit here.

Introduction to AWS Redshift

Redshift Logo
Image Source

AWS Redshift is capable of executing complex queries over millions of runs and return instant results through a Postgres compatible querying layer. A massively parallel architecture made using a cluster of processing nodes is responsible for this capability. 

One of these nodes acts as the leader and handles activities related to client communication, query execution plans, and work assignments to other nodes. Redshift provides the customers with the flexibility to choose from different types of instances that suit their budget and nature of use cases.

Redshift can scale up to 2 PB of data and this is done by adding more nodes, upgrading nodes, or both. Redshift offers a unique feature called the concurrency scaling feature which makes scaling as seamless as it can without going over budget and resource limits set by customers. 

To learn more about Redshift, visit here.

Methods to Connect Amazon S3 to Redshift

Method 1: Using the COPY Command to Connect Amazon S3 to Redshift

Redshift’s native COPY command allows you to use AWS S3 as your data source and perform a direct bulk data load process. The allowed data source formats are CSV, JSON, or AVRO. 

Method 2: Using AWS Services to Connect Amazon S3 to Redshift

AWS offers a number of services that can be used to perform data load operations to Redshift Data Warehouse. AWS Glue and AWS Data pipeline are 2 such services that enable you to transfer data from Amazon S3 to Redshift.

Method 3: Using Hevo Data to Connect Amazon S3 to Redshift

Hevo Data provides a hassle-free solution and helps you directly transfer data from S3 to Redshift and numerous other Databases/Data Warehouses or destinations of your choice without any intervention in an effortless manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Hevo’s pre-built integration with Redshift along with 100+ other data sources (including 30+ free data sources) will take full charge of the data transfer process, allowing you to focus on key business activities for free.

Get Started with Hevo for Free

Methods to Connect Amazon S3 to Redshift

Loading data from S3 to Redshift can be accomplished in the following 3 ways:

Method 1: Using to COPY Command Connect Amazon S3 to Redshift

Redshift’s COPY command can use AWS S3 as a source and perform a bulk data load. The data source format can be CSV, JSON, or AVRO. 

Assuming the target table is already created, the simplest COPY command to load a CSV file from S3 to Redshift will be as below. The S3 data location here is the product_details.csv. Please ensure Redshift tables are created already.

copy product_tgt1

from 's3://productdata/product_tgt/product_details.csv

iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

region 'us-east-2';

The above approach uses a single CSV file to load the data. A better approach in the case of large files will be to split the file to multiple smaller ones so that the COPY operation can exploit the parallel processing capability that is inherent to Redshift. This can be done using a manifest file that has the list of locations from which COPY operation should take its input files.

By default, the COPY operation tries to convert the source data types to Redshift data types. This implicit conversion can lead to unanticipated results if done without proper planning. To mitigate this, Redshift provides configuration options for explicit data type conversions. A configuration file can also be used to set up the source and target column name mapping. 

The COPY command is explored in detail here.

Method 2: Using AWS Services to Connect Amazon S3 to Redshift

AWS provides a number of alternatives to perform data load operations to Redshift. AWS Glue and AWS Data pipeline are two such services that can fit this requirement:

1) Using AWS Data Pipeline to Connect Amazon S3 to Redshift

AWS data pipeline hides away the complex details of setting up an  ETL pipeline behind a simple web UI. It does this by offering template activities that users can customize based on their requirements. The template activity which we will use here is the RedshiftCopyActivity. This activity supports S3 as a source type.

Different insert modes are possible in RedshiftCopyActivity – KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. 

KEEP EXISTING and OVERWRITE EXISTING are here to enable the users to define if the rows with the same primary key are to be overwritten or kept as such

AWS Data pipeline and the features offered are explored in detail here.

2) Using AWS Glue to Connect Amazon S3 to Redshift

Glue is an Extract Transform and Load tool as a web service offered by Amazon. It works based on an elastic spark backend to execute the processing jobs. Glue supports S3 locations as a storage source in Glue scripts. Internally It uses the COPY and UNLOAD command to accomplish copying data to Redshift, but spares users from learning the COPY command configuration by abstracting away the details. It uses a script in its own proprietary domain-specific language to represent data flows.

Glue uses a concept called dynamic frames to represent the source and targets. S3 location is a supported dynamic frame. The complete script will look as below.

s3_connection_options = {"paths": [ "s3://src_data/src_table_1", "s3://src_data/src_table_2"]}

df = glueContext.create_dynamic_frame_from_options("s3_table", s3_connection_options)

The dynamic frame created using the above commands can then be used to execute a copy process as follows.

redshift_connection_options = {

    "dbtable": "target_table",

    "database": "target_database",

    "aws_iam_role": "arn:aws:iam::account-id:role/role-name"

}

glueContext.write_dynamic_frame.from_jdbc_conf(

    frame = s3_table, 

    catalog_connection = "redshift_con_name", 

    connection_options = redshift_con_options, 

    redshift_tmp_dir = args["TempDir"])

Writing a custom script for a simple process like this can seem a bit convoluted. Glue offers a simpler method using a web UI to automatically create these scripts if the above configurations are known. More details about Glue can be found here.

Limitations of the Above Approaches

The first method described here uses Redshift’s native abilities to load data from S3. This method has a number of limitations which are as follows:

  • This will work only in case of a first-time bulk load and if your use case needs incremental load, then a separate process involving a staging table will need to be implemented. 
  • The implicit data type conversions that happen by default can become a serious issue leading to data corruption.
  • In case you are looking to transform any data before loading to Redshift, these approaches do not accommodate that.

Method 3: Using Hevo Data to Connect Amazon S3 to Redshift

Hevo Data Logo
Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Amazon S3 and 100+ other data sources to Redshift, Databases, BI tools, or a destination of your choice in a completely hassle-free & automated manner for free. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Hevo Data takes care of all your data preprocessing needs and lets you focus on key business activities and draw a much powerful insight on how to generate more leads, retain customers, and take your business to new heights of profitability. It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination.

Sign up here for a 14-Day Free Trial!

The following 3 steps are required to connect Amazon S3 to Redshift using Hevo Data:

  • Connect to S3 data source by providing credentials as shown in the below image.
Configuring S3 as source
Image Source
  • Select the mode of replication you want
  • Configure Redshift warehouse where the data needs to be moved as shown in the image below.
Configuring Redshift as Destnation
Image Source

More Reasons to Choose Hevo Data

  • Faster Implementation: A very quick 2-stage process to get your pipeline setup. After that, everything’s automated while you watch data sync to Snowflake or any other destination in real-time. 
  • Fully Managed Service & Live Support: Every aspect of the data transfer process is fully managed by Hevo. Hevo also provides a dedicated team of talented support engineers to guide you through any issues that arise.
  • Real-time Alerts & Notifications: With Hevo, you are always in the know about your data transfer pipelines. Receive real-time multiple-format notifications across your various devices.
  • 100% Complete & Accurate  Data Transfer: The entire ETL spectrum is covered by Hevo. Quality rules and expert engineering also ensure that data is always 100% transmitted so there’s no risk of data corruption or missing data.
  • Scalability Assured: Hevo simply scales to meet your needs today and tomorrow thanks to its world-class infrastructure and platform.
  • A wide array of Data Source Integrations: Hevo can bring data from not only Facebook Ads, but a wide array of data sources ranging from Sales and Marketing tools to databases and more.

Conclusion

This article discussed Amazon S3 and Redshift and explained the 3 methods using which you can connect S3 to Redshift with ease. AWS Services like Glue and Data pipeline abstracts away such details to an extent, but they can still become overwhelming for a first-time user. For someone to quickly create a load job from S3 to Redshift without going in deep into AWS configurations and other details, an ETL tool like Hevo which can accomplish this in a matter of clicks is a better alternative. 

Visit our Website to Explore Hevo

Hevo Data offers a faster way to move data from 100+ data sources such as SaaS applications or Databases such as S3 into your Redshift Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand

No-code Data Pipeline for Redshift