Amazon S3 to Redshift: Steps to Load Data in Minutes

on Tutorial • February 22nd, 2020 • Write for Hevo

AWS S3 is a completely managed general-purpose storage mechanism offered by Amazon based on a software as a service business model.  S3 can be used to serve any storage requirement ranging from a simple backup service to archiving a full data warehouse. It can be used for any requirement up to 5 TB of data. In the enterprise data pipelines, it is typical to use S3 as a staging location or a temporary data dumping location before loading data into a data warehouse for offline analysis. For customers staying within the AWS ecosystem, Redshift is a great option as a completely managed data warehouse service. In this post, we will learn about how to load data from S3 to Redshift.

Understanding Amazon S3

As mentioned above AWS S3 is a completely managed object storage service accessed entirely through web APIs and AWS provided CLI utilities. Like any completely managed service offered by Amazon, all operational activities related to pre-provisioning, capacity scaling, etc are abstracted away from users. The customers are required to pay for the amount of space that they use. Access controls are comprehensive enough to meet typical compliance requirements. It offers granular access controls to meet all kinds of organizational and business compliance requirements. 

AWS Athena and AWS redshift spectrum allow users to run analytical queries on data stored in S3 buckets. S3 offers high availability. This comes from the fact that it stores data across a cluster of distributed servers. This approach means there is a related propagation delay and S3 can only guarantee eventual consistency. S3 writes are atomic though. 

Structurally, S3 is envisioned as buckets and objects. A bucket is a container for storing all kinds of objects. It also represents the highest level of namespace. Buckets contain objects which represent the basic storage entity. An object is a fusion of the stored object as well as its metadata. A unique key and version identify an object uniquely.

Understanding AWS Redshift

AWS Redshift is capable of executing complex queries over millions of runs and return instant results through a Postgres compatible querying layer. A massively parallel architecture made using a cluster of processing nodes is responsible for this capability. 

One of these nodes acts as the leader and handles activities related to client communication, query execution plans and work assignments to other nodes. Redshift architecture can be explored in detail here.

Redshift provides the customers with the flexibility to choose from different types of instances that suit their budget and nature of use cases. Redshift pricing details are analyzed in a blog post here.

Redshift can scale up to 2 PB of data and this is done adding more nodes, upgrading nodes or both. Redshift offers a unique feature called concurrency scaling feature which makes scaling as seamless as it can without going over budget and resource limits set by customers. 

Loading Data from AWS S3 to Redshift

Loading data from S3 to Redshift can be accomplished in three ways.

  1. Using Redshift’s native COPY command
  2. Using AWS services like Glue or AWS Data pipeline
  3. Using a completely managed Data integration platform like Hevo Data

S3 to Redshift: Using Redshift’s native COPY command

Redshift’s COPY command can use AWS S3 as a source and perform a bulk data load. The data source format can be CSV, JSON or AVRO. 

Assuming the target table is already created, the simplest COPY command to load a CSV file from S3 to Redshift will be as below. The S3 data location here is the product_details.csv. Please ensure Redshift tables are created already.

copy product_tgt1

from 's3://productdata/product_tgt/product_details.csv

iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

region 'us-east-2';

The above approach uses a single CSV file to load the data. A better approach in case of large files will be to split the file to multiple smaller ones so that the COPY operation can exploit the parallel processing capability that is inherent to Redshift. This can be done using a manifest file that has the list of locations from which COPY operation should take its input files.

By default, the COPY operation tries to convert the source data types to Redshift data types. This implicit conversion can lead to unanticipated results if done without proper planning. To mitigate this, Redshift provides configuration options for explicit data type conversions. A configuration file can also be used to set up the source and target column name mapping. 

COPY command is explored in detail here.

S3 to Redshift: Using AWS Services

AWS provides a number of alternatives to perform data load operation to Redshift. AWS Glue and AWS Data pipeline are two such services that can fit this requirement.

AWS Data Pipeline

AWS data pipeline hides away the complex details of setting up an  ETL pipeline behind a simple web UI. It does this by offering template activities that users can customize based on their requirements. The template activity which we will use here is the RedshiftCopyActivity. This activity supports S3 as a source type.

Different insert modes are possible in RedshiftCopyActivity – KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. 

KEEP EXISTING and OVERWRITE EXISTING are here to enable the users to define if the rows with the same primary key are to be overwritten or kept as such

AWS Data pipeline and the features offered are explored in detail here.

AWS Glue

Glue is an Extract Transform and Load tool as a web service offered by Amazon. It works based on an elastic spark backend to execute the processing jobs. Glue supports S3 locations as storage source in Glue scripts. Internally It uses the COPY and UNLOAD command to accomplish copying data to Redshift, but spares users of learning the COPY command configuration by abstracting away the details. It uses a script in its own proprietary domain-specific language to represent data flows.

Glue uses a concept called dynamic frames to represent the source and targets. S3 location is a supported dynamic frame. The complete script will look as below.

s3_connection_options = {"paths": [ "s3://src_data/src_table_1", "s3://src_data/src_table_2"]}

df = glueContext.create_dynamic_frame_from_options("s3_table", s3_connection_options)

The dynamic frame created using the above commands can then be used to execute a copy process as follows.

redshift_connection_options = {

    "dbtable": "target_table",

    "database": "target_database",

    "aws_iam_role": "arn:aws:iam::account-id:role/role-name"

}

glueContext.write_dynamic_frame.from_jdbc_conf(

    frame = s3_table, 

    catalog_connection = "redshift_con_name", 

    connection_options = redshift_con_options, 

    redshift_tmp_dir = args["TempDir"])

Writing a custom script for a simple process like this can seem a bit convoluted. Glue offers a simpler method using a web UI to automatically create these scripts if the above configurations are known. More details about Glue can be found here.

Limitations of the Above Approaches

The first method described here uses Redshift’s native abilities to load data from S3. This method has a number of limitations.

  1. This will work only in case of a first-time bulk load and if your use case needs incremental load, then a separate process involving a staging table will need to be implemented. 
  2. The implicit data type conversions that happen by default can become a serious issue leading to data corruption.
  3. In case you are looking to transform any data before loading to Redshift, these approaches do not accommodate for that.

AWS Services like Glue and Data pipeline abstracts away such details to an extent, but they can still become overwhelming for a first time user. For someone to quickly create a load job from S3 to Redshift without going in deep into AWS configurations and other details, an ETL tool like Hevo which can accomplish this in a matter of clicks is a better alternative. 

Load data from S3 to Redshift using Hevo

Hevo is a fully managed Data Integration platform that can help you load data from not just S3, but many other data sources into Redshift in real-time. Using a fully-managed Data Pipeline platform like Hevo, you will be able to overcome all the limitations of the methods mentioned previously. Hevo can help you bring data from a variety of data sources both within and outside of the AWS ecosystem in just a few minutes into Redshift. Here are steps move data from S3 to Redshift using Hevo

  • Connect to S3 data source by providing credentials
  • Select the mode of replication you want
  • Configure Redshift warehouse where the data needs to be moved

Advantages of using Hevo to load data to Redshift:

  1. Code-free, real-time data integration
  2. Automatic schema detection and mapping
  3. Ability to transform the data before and after loading it to the warehouse
  4. Fault-tolerant, reliable system with zero data loss guarantee

Explore the features here and sign up for a free trial to experience hassle-free data loading to Redshift, first hand. 

No-code Data Pipeline for Redshift