Press "Enter" to skip to content

Loading Data to Redshift – 4 Best Methods

Redshift has emerged as a popular completely managed data warehousing service offered by Amazon. Redshift offers a very flexible pay-as-you-use pricing model, which allows the customers to pay for the storage and the instance type they use. Increasingly, more and more businesses are choosing to adopt Redshift for their warehousing needs. In this blog, we discuss one of the key aspects of building your Redshift data warehouse: loading data to Redshift. 

Loading Data to Redshift

Amazon Redshift Features

Redshift’s biggest advantage is its ability to run complex queries over millions of rows and return ultra quick results. The querying layer is implemented based on the PostgreSQL standard. This is made possible by Redshift’s massively parallel processing architecture which uses a collection of compute instances for storage and processing. Redshift enables the customers to choose among different types of instances according to their budget and whether they have a storage-intensive use case or a compute-intensive use case. Redshift’s dense compute instances have SSDs and the dense storage instances come with HDDs. A detailed article on Redshift pricing can be found here.

In Redshift’s massively parallel processing architecture, one of the instances is designated as a leader node. Other nodes are known as compute nodes and are responsible for actually executing the queries. Leader nodes handle the client communication, prepared query execution plans and assign work to the compute nodes according to the slices of data they handle. A detailed explanation of Redshift architecture can be found here.

Redshift can scale seamlessly by adding more nodes, upgrading nodes or both. The limit of Redshift scaling is fixed at 2PB of data. The latest generation of Redshift nodes is capable of reducing the scaling downtimes to a few minutes. Redshift offers a feature called concurrency scaling which can scale the instances automatically during high load times while adhering to the budget and resource limits predefined by customers. Concurrency scaling is priced separately, but users are provided with a free hour of concurrent scaling for every 24 hours a Redshift cluster stays operational. Redshift Spectrum is another unique feature offered by AWS, which allows the customers to use only the processing capability of Redshift. The data, in this case, is stored in AWS S3 and not included as Redshift tables.

Loading Data to Redshift

Redshift provides multiple ways of loading data from various sources. On a broad level, data loading mechanisms to Redshift can be categorized into the below methods

  1. Redshift bulk loading using the COPY command
  2. Redshift data loading using INSERT INTO command
  3. Redshift data loading using AWS Services
  4. Using fully-managed data integration platforms

Redshift COPY Command

The redshift COPY command is the standard way of loading bulk data TO Redshift. COPY command can use the following sources for loading data.

  • DynamoDB
  • Amazon S3 storage
  • Amazon EMR cluster. 

Other than specifying the locations of the files from where data has to be fetched, the COPY command can also use manifest files which has a list of file locations. It is recommended to use this approach since the COPY command supports parallel operation and copying a list of small files will be faster than copying a large file. This is because, while loading data from multiple files, workload is distributed among the nodes in the cluster. 

COPY command accepts several input file formats including CSV, JSON, AVRO, etc.

It is possible to provide a column mapping file to configure which columns in the input files get written to specific Redshift columns.

COPY command also has configurations to simple implicit data conversions. If nothing is specified the data types are converted automatically to Redshift target tables’ data type.

The simplest COPY command for loading data an S3 location to a Redshift target table named product_tgt1 will be as follows. Redshift table should be created beforehand for this to work.

copy product_tgt1

from 's3://productdata/product_tgt/product_tgt1.txt'

iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

region 'us-east-2';

A detailed explanation of how to use the COPY command is available here.

Loading Data using INSERT INTO Command

Redshift’s INSERT INTO command is implemented based on the PostgreSQL. The simplest example of the INSERT INTO command for inserting four values into a table named employee_records is as follows. 

INSERT INTO employee_records(emp_id,department,designation,category) values(1,’admin’,’assistant’,’contract’);

It can perform insertions based on the following input records.

  • The above code snippet is an example of inserting single row input records with column names specified with the command. This means the column values have to be in the same order as the provided column names. 
  • An alternative to this command is the single row input record without specifying column names. In this case, the column values are always inserted into the first n columns.  
  • INSERT INTO command also supports multi-row inserts. The column values are provided with a list of records. 
  • This command can also be used to insert rows based on a query. In that case, the query should return the values to be inserted into the exact columns in the same order specified in the command. 

A detailed note on the Redshift INSERT INTO command is available here

Even though the INSERT INTO command is very flexible, it can lead to surprising errors because of the implicit data type conversions. This command is also not suitable for the bulk insert of data.

Data Loading using AWS Services

AWS provides a set of utilities for loading data from different sources to Redshift. AWS Glue and AWS Data pipeline are two of the easiest to use services for loading data from AWS table.

AWS Data Pipeline

AWS data pipeline is a web service that offers extraction, transformation, and loading of data as a service. The power of the AWS data pipeline comes from Amazon’s elastic map-reduce platform. This relieves the users of the headache to implement a complex ETL framework and helps them focus on the actual business logic.

AWS Data pipeline offers a template activity called RedshiftCopyActivity that can be used to copy data from different kinds of sources to Redshift. RedshiftCopyActivity helps to copy data from the following sources.

  • Amazon RDS
  • Amazon EMR
  • Amazon S3 storage

RedshiftCopyActivity has different insert modes – KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. 

KEEP EXISTING and OVERWRITE EXISTING considers the primary key and sort keys of Redshift and allows users to control whether to overwrite or keep the current rows if rows with the same primary keys are detected. 

A detailed explanation of how to use the AWS Data pipeline can be found here.

AWS Glue

Glue is an ETL tool offered as a service by Amazon that uses an elastic spark backend to execute the jobs. Glue has the ability to discover new data whenever they come to the AWS ecosystem and store the metadata in catalog tables. 

Internally Glue uses the COPY and UNLOAD command to accomplish copying data to Redshift. For executing a copying operation, users need to write a glue script in its own domain-specific language. 

Glue works based on dynamic frames. Before executing the copy activity, users need to create a dynamic frame from the data source. Assuming data is present in S3, this is done as follows.

connection_options = {"paths": [ "s3://product_data/products_1", "s3://product_data/products_2"]}

df = glueContext.create_dynamic_frame_from_options("s3_source", connection-options)

The above command creates a dynamic frame from two S3 locations. This dynamic frame can then be used to execute a copy operation as follows.

connection_options = {

    "dbtable": "redshift-target-table",

    "database": "redshift-target-database",

    "aws_iam_role": "arn:aws:iam::account-id:role/role-name"

}

glueContext.write_dynamic_frame.from_jdbc_conf(

    frame = s3_source, 

    catalog_connection = "redshift-connection-name", 

    connection_options = connection-options

    redshift_tmp_dir = args["TempDir"])

The above method of writing custom scripts may seem a bit overwhelming at first. Glue can also auto-generate these scripts based on a web UI if the above configurations are known. You can read more about Glue here.

Loading Data to Redshift Using Hevo

Using a fully-managed, automated Data Pipeline like Hevo, you will be able to overcome all the limitations of the methods mentioned previously. Hevo can help you bring data from a variety of data sources both within and outside of the AWS ecosystem in just a few minutes into Redshift. Here are some of the things Hevo can take care for you: 

Advantages of using Hevo to load data to Redshift:

  1. Code-free, real-time data integration
  2. Automatic schema detection and mapping
  3. Ability to transform the data before and after loading it to the warehouse
  4. Fault-tolerant, reliable system with zero data loss guarantee

Explore the features here and sign up for a free trial to experience hassle-free data loading to redshift, first hand. 

Concluding Note

The above sections detail different ways of copying data to Redshift. The first two methods of COPY and INSERT INTO command use Redshift’s native ability, while the last two methods build abstraction layers over the native methods. Other than this, it is also possible to build custom ETL tools based on the Redshift native functionality. AWS’s own services have some limitations when it comes to data sources outside the AWS ecosystem. All of this comes at the cost of time and precious engineering resources. If you are keen to make swift progress on your data projects, you can try Hevo – your fully managed data pipeline as a service.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial