Loading Data to Redshift: 4 Best Methods

on Data Integration • October 19th, 2019 • Write for Hevo

Redshift has emerged as a popular completely managed data warehousing service offered by Amazon. Redshift offers a very flexible pay-as-you-use pricing model, which allows the customers to pay for the storage and the instance type they use. Increasingly, more and more businesses are choosing to adopt Redshift for their warehousing needs. In this blog, we discuss one of the key aspects of building your Redshift data warehouse: loading data to Redshift. 

Table of Contents

Key Features of Amazon Redshift

Redshift Logo.

Redshift’s biggest advantage is its ability to run complex queries over millions of rows and return ultra quick results. The querying layer is implemented based on the PostgreSQL standard. This is made possible by Redshift’s massively parallel processing architecture which uses a collection of compute instances for storage and processing. Redshift enables the customers to choose among different types of instances according to their budget and whether they have a storage-intensive use case or a compute-intensive use case. Redshift’s dense compute instances have SSDs and the dense storage instances come with HDDs. A detailed article on Redshift pricing can be found here.

In Redshift’s massively parallel processing architecture, one of the instances is designated as a leader node. Other nodes are known as compute nodes and are responsible for actually executing the queries. Leader nodes handle the client communication, prepared query execution plans and assign work to the compute nodes according to the slices of data they handle. A detailed explanation of Redshift architecture can be found here.

Redshift can scale seamlessly by adding more nodes, upgrading nodes or both. The limit of Redshift scaling is fixed at 2PB of data. The latest generation of Redshift nodes is capable of reducing the scaling downtimes to a few minutes. Redshift offers a feature called concurrency scaling which can scale the instances automatically during high load times while adhering to the budget and resource limits predefined by customers.

Concurrency scaling is priced separately, but users are provided with a free hour of concurrent scaling for every 24 hours a Redshift cluster stays operational. Redshift Spectrum is another unique feature offered by AWS, which allows the customers to use only the processing capability of Redshift. The data, in this case, is stored in AWS S3 and not included as Redshift tables.

Various Methods of Loading Data to Redshift

Method 1: Loading Data to Redshift using the Copy Command

The Redshift Copy Command is one of the most popular ways of importing data into Redshift and supports loading data of various formats such as CSV, JSON, AVRO, etc. This method makes use of DynamoDB, S3 or the EMR cluster to facilitate the data load process and works well with bulk data loads.

Method 2: Loading Data to Redshift using the Insert Into Command

This method makes use of the Redshift Insert Into Command. The Insert Into Command is based on PostgreSQL and allows users to insert single or multiple records of data into their Redshift tables. It requires users to write queries to insert data as per their requirement.

Method 3: Loading Data to Redshift using AWS Services

Using various utilities provided by Amazon Web Service to load data into Redshift is one such way. AWS supports bringing in data from a variety of sources and transferring it, either by using the AWS data pipeline or AWS Glue, an ETL tool, which makes use of the copy and the unload command.

Method 4: Using Hevo Data, a Node Data Pipeline

A fully managed, No-code Data Pipeline platform like Hevo Data, helps you load data from 100+ Sources to Redshift in real-time, in an effortless manner. Hevo with its minimal learning curve can be set up in a matter of minutes making the users ready to load data without compromising performance. Its strong integration with various sources such as databases, files, analytics engine, etc gives users the flexibility to bring in data of all different kinds in a way that’s as smooth as possible, without having to write a single line of code.

Get started with Hevo! Sign up for a 14-day free trial! 

Methods for Loading Data to Redshift

Redshift provides multiple ways of loading data from various sources. On a broad level, data loading mechanisms to Redshift can be categorized into the below methods:

Method 1: Loading Data to Redshift using the Copy Command

The redshift COPY command is the standard way of loading bulk data TO Redshift. COPY command can use the following sources for loading data.

Other than specifying the locations of the files from where data has to be fetched, the COPY command can also use manifest files which has a list of file locations. It is recommended to use this approach since the COPY command supports parallel operation and copying a list of small files will be faster than copying a large file. This is because, while loading data from multiple files, workload is distributed among the nodes in the cluster. 

COPY command accepts several input file formats including CSV, JSON, AVRO, etc.

It is possible to provide a column mapping file to configure which columns in the input files get written to specific Redshift columns.

COPY command also has configurations to simple implicit data conversions. If nothing is specified the data types are converted automatically to Redshift target tables’ data type.

The simplest COPY command for loading data an S3 location to a Redshift target table named product_tgt1 will be as follows. Redshift table should be created beforehand for this to work.

copy product_tgt1

from 's3://productdata/product_tgt/product_tgt1.txt'

iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

region 'us-east-2';

A detailed explanation of how to use the COPY command is available here.

Method 2: Loading Data to Redshift using Insert Into Command

Redshift’s INSERT INTO command is implemented based on the PostgreSQL. The simplest example of the INSERT INTO command for inserting four values into a table named employee_records is as follows. 

INSERT INTO employee_records(emp_id,department,designation,category) 
values(1,’admin’,’assistant’,’contract’);

It can perform insertions based on the following input records.

  • The above code snippet is an example of inserting single row input records with column names specified with the command. This means the column values have to be in the same order as the provided column names. 
  • An alternative to this command is the single row input record without specifying column names. In this case, the column values are always inserted into the first n columns.  
  • INSERT INTO command also supports multi-row inserts. The column values are provided with a list of records. 
  • This command can also be used to insert rows based on a query. In that case, the query should return the values to be inserted into the exact columns in the same order specified in the command. 

A detailed note on the Redshift INSERT INTO command is available here

Even though the INSERT INTO command is very flexible, it can lead to surprising errors because of the implicit data type conversions. This command is also not suitable for the bulk insert of data.

Method 3: Loading Data to Redshift using AWS Services

AWS provides a set of utilities for loading data from different sources to Redshift. AWS Glue and AWS Data pipeline are two of the easiest to use services for loading data from AWS table.

AWS Data Pipeline

AWS Data Pipeline.

AWS data pipeline is a web service that offers extraction, transformation, and loading of data as a service. The power of the AWS data pipeline comes from Amazon’s elastic map-reduce platform. This relieves the users of the headache to implement a complex ETL framework and helps them focus on the actual business logic.

AWS Data pipeline offers a template activity called RedshiftCopyActivity that can be used to copy data from different kinds of sources to Redshift. RedshiftCopyActivity helps to copy data from the following sources.

RedshiftCopyActivity has different insert modes – KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. 

KEEP EXISTING and OVERWRITE EXISTING considers the primary key and sort keys of Redshift and allows users to control whether to overwrite or keep the current rows if rows with the same primary keys are detected. 

A detailed explanation of how to use the AWS Data pipeline can be found here.

AWS Glue

AWS Glue.

AWS Glue is an ETL tool offered as a service by Amazon that uses an elastic spark backend to execute the jobs. Glue has the ability to discover new data whenever they come to the AWS ecosystem and store the metadata in catalogue tables. 

Internally Glue uses the COPY and UNLOAD command to accomplish copying data to Redshift. For executing a copying operation, users need to write a glue script in its own domain-specific language. 

Glue works based on dynamic frames. Before executing the copy activity, users need to create a dynamic frame from the data source. Assuming data is present in S3, this is done as follows.

connection_options = {"paths": [ "s3://product_data/products_1", "s3://product_data/products_2"]}

df = glueContext.create_dynamic_frame_from_options("s3_source", connection-options)

The above command creates a dynamic frame from two S3 locations. This dynamic frame can then be used to execute a copy operation as follows.

connection_options = {

    "dbtable": "redshift-target-table",

    "database": "redshift-target-database",

    "aws_iam_role": "arn:aws:iam::account-id:role/role-name"

}

glueContext.write_dynamic_frame.from_jdbc_conf(

    frame = s3_source, 

    catalog_connection = "redshift-connection-name", 

    connection_options = connection-options, 

    redshift_tmp_dir = args["TempDir"])

The above method of writing custom scripts may seem a bit overwhelming at first. Glue can also auto-generate these scripts based on a web UI if the above configurations are known. You can read more about Glue here.

Method 4: Using Hevo Data, a No-code Data Pipeline

Hevo Data Logo.

Hevo Data, a No-code Data Pipeline can help you move data from 100+ sources swiftly to Redshift. Hevo is fully-managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using BI tools such as Tableau and many more. 

Steps to use Hevo Data:

Hevo Data focuses on two simple steps to get you started:

Configure Source: Connect Hevo Data with your desired data source in just a few clicks. You can choose from a variety of sources such as MongoDB, JIRA, Salesforce, Zendesk, Marketo, Google Analytics, Google Drive, etc. and a lot more.

Integrate Data: Load data from your desired data source to Redshift by simply providing your Redshift database credentials. Enter a name for your database, the host and port number for your Redshift database and connect in a matter of minutes.

Connecting with Redshift using Hevo.

Advantages of using Hevo Data Platform:

  • Real-Time Data Export: Hevo with its strong integration with 100+ sources, allows you to transfer data quickly & efficiently. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.

Conclusion

The above sections detail different ways of copying data to Redshift. The first two methods of COPY and INSERT INTO command use Redshift’s native ability, while the last two methods build abstraction layers over the native methods. Other than this, it is also possible to build custom ETL tools based on the Redshift native functionality. AWS’s own services have some limitations when it comes to data sources outside the AWS ecosystem. All of this comes at the cost of time and precious engineering resources.

Want to try Hevo? Sign up for a 14-day free trial and experience the feature-rich Hevo suite first hand. Have a look at our unbeatable pricing, that will help you choose the right plan for you.

What are your thoughts on moving data from Shopify to Redshift? Let us know in the comments.

No-code Data Pipeline for Redshift