AWS S3 is a completely managed general-purpose storage mechanism offered by Amazon based on a software-as-a-service business model.  S3 can be used to serve any storage requirement ranging from a simple backup service to archiving a full data warehouse. It can be used for any requirement of up to 5 TB of data. In enterprise data pipelines, it is typical to use S3 as a staging location or a temporary data dumping location before loading data into a Data Warehouse for offline Data Analysis. For customers staying within the AWS ecosystem, a Redshift is a great option as a completely managed Data Warehouse service.

This post will introduce you to Amazon S3 and Redshift. Moreover, it will explain 3 step-by-step methods which will help you to connect Amazon S3 to Redshift easily. Read along to understand more about the steps, benefits, and limitations of these methods.

Prerequisites

  • An S3 account.
  • A Redshift account.
  • Working knowledge of Redshift commands.
Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

What is Amazon S3?

Amazon S3 to Redshift: S3 Logo | Hevo Data
Image Source

As mentioned above AWS S3 is a completely managed object storage service accessed entirely through web APIs and AWS-provided CLI utilities. Like any completely managed service offered by Amazon, all operational activities related to pre-provisioning, capacity scaling, etc are abstracted away from users. The customers are required to pay for the amount of space that they use. Access controls are comprehensive enough to meet typical compliance requirements. It offers granular access controls to meet all kinds of organizational and business compliance requirements. 

AWS Athena and AWS redshift spectrum allow users to run analytical queries on data stored in S3 buckets. S3 offers high availability. This comes from the fact that it stores data across a cluster of distributed servers. This approach means there is a related propagation delay and S3 can only guarantee eventual consistency. S3 writes are atomic though. 

Structurally, S3 is envisioned as buckets and objects. A bucket is a container for storing all kinds of objects. It also represents the highest level of the namespace. Buckets contain objects which represent the basic storage entity. An object is a fusion of the stored object as well as its metadata. A unique key and version identify an object uniquely.

What is AWS Redshift?

Amazon S3 to Redshift: Redshift Logo | Hevo Data
Image Source

AWS Redshift is capable of executing complex queries over millions of runs and returning instant results through a Postgres-compatible querying layer. A massively parallel architecture made using a cluster of processing nodes is responsible for this capability. Here learn about Amazon Redshift vs PostgreSQL.

One of these nodes acts as the leader and handles activities related to client communication, query execution plans, and work assignments to other nodes. Redshift provides customers with the flexibility to choose from different types of instances that suit their budget and the nature of use cases.

Redshift can scale up to 2 PB of data and this is done by adding more nodes, upgrading nodes, or both. Redshift offers a unique feature called the concurrency scaling feature which makes scaling as seamless as it can without going over budget and resource limits set by customers.

Methods to Connect Amazon S3 to Redshift

Method 1: Using the COPY Command to Connect Amazon S3 to Redshift

Redshift’s native COPY command allows you to use AWS S3 as your data source and perform a direct bulk data load process. The allowed data source formats are CSV, JSON, or AVRO. 

Method 2: Using AWS Services to Connect Amazon S3 to Redshift

AWS offers a number of services that can be used to perform data load operations to Redshift Data Warehouse. AWS Glue and AWS Data pipeline are 2 such services that enable you to transfer data from Amazon S3 to Redshift.

Method 3: Using Hevo’s No Code Data Pipeline to Connect Amazon S3 to Redshift

Hevodata, an Automated No-code Data Pipeline provides a hassle-free solution and helps you directly transfer data from S3 to Redshift within minutes. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.

Hevo’s pre-built integration with S3 and 100+ other data sources will take full charge of your data transfer process and let you focus your engineering bandwidth on key business activities.

What are the Methods to Connect Amazon S3 to Redshift?

Loading data from S3 to Redshift can be accomplished in the following 3 ways:

Method 1: Using COPY Command Connect Amazon S3 to Redshift

Redshift’s COPY command can use AWS S3 as a source and perform a bulk data load. The data source format can be CSV, JSON, or AVRO. 

Assuming the target table is already created, the simplest COPY command to load a CSV file from S3 to Redshift will be as below. The S3 data location here is the product_details.csv. Please ensure Redshift tables are created already.

copy product_tgt1

from 's3://productdata/product_tgt/product_details.csv

iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

region 'us-east-2';

The above approach uses a single CSV file to load the data. A better approach in the case of large files will be to split the file to multiple smaller ones so that the COPY operation can exploit the parallel processing capability that is inherent to Redshift. This can be done using a manifest file that has the list of locations from which COPY operation should take its input files.

By default, the COPY operation tries to convert the source data types to Redshift data types. This implicit conversion can lead to unanticipated results if done without proper planning. To mitigate this, Redshift provides configuration options for explicit data type conversions. A configuration file can also be used to set up the source and target column name mapping. 

The COPY command is explored in detail here.

Method 2: Using AWS Services to Connect Amazon S3 to Redshift

AWS provides a number of alternatives to perform data load operations to Redshift. AWS Glue and AWS Data pipeline are two such services that can fit this requirement:

1) Using AWS Data Pipeline to Connect Amazon S3 to Redshift

AWS data pipeline hides away the complex details of setting up an  ETL pipeline behind a simple web UI. It does this by offering template activities that users can customize based on their requirements. The template activity which we will use here is the RedshiftCopyActivity. This activity supports S3 as a source type.

Different insert modes are possible in RedshiftCopyActivity – KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. 

KEEP EXISTING and OVERWRITE EXISTING are here to enable the users to define if the rows with the same primary key are to be overwritten or kept as such

AWS Data pipeline and the features offered are explored in detail here.

2) Using AWS Glue to Connect Amazon S3 to Redshift

Glue is an Extract Transform and Load tool as a web service offered by Amazon. It works based on an elastic spark backend to execute the processing jobs. Glue supports S3 locations as a storage source in Glue scripts. Internally It uses the COPY and UNLOAD commands to accomplish copying data to Redshift, but spares users from learning the COPY command configuration by abstracting away the details. It uses a script in its own proprietary domain-specific language to represent data flows.

Glue uses a concept called dynamic frames to represent the source and targets. S3 location is a supported dynamic frame. The complete script will look as below.

s3_connection_options = {"paths": [ "s3://src_data/src_table_1", "s3://src_data/src_table_2"]}

df = glueContext.create_dynamic_frame_from_options("s3_table", s3_connection_options)

The dynamic frame created using the above commands can then be used to execute a copy process as follows.

redshift_connection_options = {

    "dbtable": "target_table",

    "database": "target_database",

    "aws_iam_role": "arn:aws:iam::account-id:role/role-name"

}

glueContext.write_dynamic_frame.from_jdbc_conf(

    frame = s3_table, 

    catalog_connection = "redshift_con_name", 

    connection_options = redshift_con_options, 

    redshift_tmp_dir = args["TempDir"])

Writing a custom script for a simple process like this can seem a bit convoluted. Glue offers a simpler method using a web UI to automatically create these scripts if the above configurations are known. More details about Glue can be found here.

Limitations of the Above Approaches

The first method described here uses Redshift’s native abilities to load data from S3. This method has a number of limitations which are as follows:

  • This will work only in case of a first-time bulk load and if your use case needs incremental load, then a separate process involving a staging table will need to be implemented. 
  • The implicit data type conversions that happen by default can become a serious issue leading to data corruption.
  • In case you are looking to transform any data before loading to Redshift, these approaches do not accommodate that.

Method 3: Using Hevo’s No Code Data Pipeline to Connect Amazon S3 to Redshift

S3 to Redshift: Hevo Logo | Hevo Data
Image Source

Hevo Data, an Automated No Code Data Pipeline can help you ETL your data swiftly from a multitude of sources to Redshift in real-time without any assistance from your side. You can set up the Redshift Destination on the fly, as part of the Pipeline creation process, or independently. Hevo allows you to load data from any of your Pipelines into an Amazon Redshift Data Warehouse. The ingested data is first staged in Hevo’s S3 bucket before it is batched and loaded to the Amazon Redshift Destination.

Hevo’s fault-tolerant ETL Pipeline offers you a secure option to unify data from 100+ other sources (including 40+ free sources) and store it in Redshift or any other Data Warehouse of your choice without writing a single line of code. You can entrust us with your data transfer process and enjoy a hassle-free experience. This way, you can focus more on Data Analysis, instead of data consolidation.

To learn more, check out Hevo’s documentation for Redshift.

The following 3 steps are required to connect Amazon S3 to Redshift using Hevo Data:

  • Connect to the S3 data source by providing credentials as shown in the below image.
Amazon S3 to Redshift: Configuring S3 as source | Hevo Data
Image Source
  • Select the mode of replication you want
  • Configure the Redshift warehouse where the data needs to be moved as shown in the image below.
Amazon S3 to Redshift: Configuring Redshift as Destination | Hevo Data
Image Source

More Reasons to Choose Hevo Data

  • Faster Implementation: A very quick 3-stage process to get your pipeline setup. After that, everything’s automated while you watch data sync to Redshift or any other destination in real time. 
  • Real-time Alerts & Notifications: With Hevo, you are always in the know about your data transfer pipelines. Receive real-time multiple-format notifications across your various devices.
  • 100% Complete & Accurate  Data Transfer: The entire ETL spectrum is covered by Hevo. Quality rules and expert engineering also ensure that data is always 100% transmitted so there’s no risk of data corruption or missing data.
  • Scalability Assured: Hevo simply scales to meet your needs today and tomorrow thanks to its world-class infrastructure and platform.
  • A wide array of Data Source Integrations: Hevo has native integrations with 100+ sources across databases, data warehouses, Streaming services, SDKs, etc. Click here for a list of 100+ integrations.

Use Hevo’s no-code data pipeline to seamlessly ETL your data from S3 and other multiple sources to Amazon Redshift in an automated way. Try our 14-day full feature access free trial!

Sign up here for a 14-Day Free Trial!

Conclusion

This article discussed Amazon S3 and Redshift and explained the 3 methods using which you can connect S3 to Redshift with ease. AWS Services like Glue and Data pipeline abstracts away such details to an extent, but they can still become overwhelming for a first-time user. The first 2 methods involve manual efforts using significant engineering bandwidth and require knowledge of the AWS environment. For someone to quickly create a load job from S3 to Redshift without going in deep into AWS configurations and other details, an ETL tool like Hevo which can accomplish this in a matter of clicks is a better alternative. 

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. You can leverage Hevo to seamlessly transfer data from S3 to Redshift in real time without writing a single line of code. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner. Hevo caters to 100+ data sources (including 40+ free sources) and can directly transfer data to Data Warehouses, Business Intelligence Tools, or any other destination of your choice in a hassle-free manner. It will make your life easier and make data migration hassle-free.

Learn more about Hevo
mm
Software Engineer, Hevo Data

With around a decade of experience, Sarad has designed and developed fundamental components of Hevo. His expertise lies in building lean solutions for various software problems, mentoring fellow engineers and exploring new technologies

No-code Data Pipeline for Redshift