Are you looking to perform Amazon Redshift bulk load? If yes, you are in the right place! Redshift is Amazon’s fully managed, NoOps, low-cost analytics data warehouse in the cloud.

With Redshift, you can query petabytes of data without having any infrastructure to manage or needing a database administrator.

  1. A lot of the data that they are uploading is split into files that can hold millions of records of small data files and this poses a significant challenge for any organization that relies on new data to improve Machine Learning performance.
  2. This is because inefficient data loading to Redshift can result in slow data migration which means that the business does not have the data it needs at the right time.
  3. Furthermore, some of the cumbersome data format conversions can lead to a lack of data quality and also having several jobs doing smaller tasks makes it more difficult to maintain the data flow. 
  4. To address this edge case, this post will walk you through how to achieve Redshift bulk load in an automated fashion — avoiding repetitive SQL scripting.

Prerequisites

Before you begin, complete the following prerequisites:

  • You need an AWS account. If you don’t already have an AWS account, you must create one. 
  • You need a Redshift cluster. Sign-in to the AWS console and create a new Redshift cluster.
  • You need to create a table in your Redshift cluster.
  • You need an S3 bucket to stage your files. For more information, see Create a Bucket.
  • A Browser, such as Chrome or Firefox

Method 1 – Use the COPY Command for Amazon Redshift Bulk Load

  1. In this method, you will use the SQL-like COPY command with Key-based access control to load data into Redshift.
  2. The COPY command can load data in parallel from Amazon DynamoDB, Amazon EMR, Amazon S3 or a remote host. The remote host needs to be a Unix or Linux computer configured to accept Secure Shell (SSH) connections. 
  3. You can take maximum advantage of Amazon Redshift’s massively parallel processing (MPP) architecture by splitting your data into multiple files to load bulk data much more efficiently. When using the COPY command, the files have to have the same structure as the target table in your AWS Redshift cluster. 

Step 1: Split your data into multiple files

  • By default, the COPY command expects the source data to be in character-delimited UTF-8 text files such as Avro, CSV, JSON, Parquet, TXT, or ORC formats. Split your data into multiple files so that the files are evenly distributed across the slices in your cluster.
  • You need to make sure that the files are roughly the same size, between 1 MB and 1 GB after compression. 
  • For example, the customers.csv file might be split into four files, as follows:
customers.csv.1
customers.csv.2
customers.csv.3
customers.csv.4

Step 2: Upload your files to Amazon S3

  • Create an Amazon S3 bucket to hold your files, and then upload the files to the S3 bucket. Ensure that the Amazon S3 bucket that holds your data is created in the same AWS Region as your cluster

Step 3: Enable Key-based access control to your S3 bucket

There are two ways to use the COPY command, either by role-based access control or key-based access control. For this demo, you will use the latter. Create an access key to allow data load using the COPY command.

  1. Go to the IAM Management Console.
  2. Click on Access Management > Users > Add User
    • Set the user name
    • Allow programmatic access for this user
  3. Next, define the permissions. Attach an existing policy that is the “AmazonS3ReadOnlyAccess”.
  4. Create the user.
  5. Copy and store the secret key in a secure location since it is only accessible during creation time.

Step 4: Run a COPY command to load the data from Amazon S3

The syntax to specify the files to be loaded is as follows:

copy <table_name> from '<data_source>'
authorization;

Use the query editor on the Amazon Redshift console to run the COPY command. The command loads the customers table using the data files with the prefix ‘customers’ in the Amazon S3 bucket testbucket.

copy customers
from 's3://testbucket/customers'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key';

Step 5: Verify that the data was loaded correctly

After you populate the table with data, use a SELECT statement to display the data contained in the table. The SELECT * statement returns all the column names and row values for all of the data in a table and is a good way to verify that the recently added data was correctly inserted into the table. The following SELECT command uses a LIMIT clause to restrict the number of returned rows. It will return the first 10 rows in the Redshift table ignoring the rest so that you don’t use too many resources. 

select * from customers
limit 10;

Method 2: Use Hevo Data for Amazon Redshift Bulk Load

Hevo is fully-managed and completely automates the process of not only exporting data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.

This utility builds on the COPY command by using a native AWS connection and this enables Engineers to upload data from multiple data repositories, enabling parallel processing, and eliminating repetitive SQL scripting to speed up the ETL process.

Hevo Data’s bulk loading feature allows users to reduce the data upload time by loading the data through a single transformation step. It eliminates custom code by providing a visual representation of the data flow. You can also configure reusable data pipelines for repetitive load jobs.

How it Works

  1. Point the service to your database or storage buckets using the dashboard and it will automatically discover your datasets, infer the data structure inside your files, and extract all the schema and store this information inside tables. 
  2. The Hevo Data wizard will generate the ETL scripts that are needed to Extract, Transform, and Load that data into tables in Redshift. These scripts will flatten all semi-structured data no matter how complex the data is, transform the input into target data types, and drop any unneeded column.

Why Choose Hevo?

  1. Reduces data upload time.
  2. Eliminates custom code.
  3. Creates reusable data pipelines.

Conclusion

This blog post explored two methods for achieving efficient Amazon Redshift Bulk Load. For users comfortable with SQL scripting, the COPY command offers a straightforward approach using key-based access control. However, for those seeking a completely automated solution, Hevo Data provides a user-friendly interface to streamline data extraction, transformation, and loading into Redshift, eliminating the need for manual scripting.

For more information about how to use the COPY command with other data stores, see the following topics:

Want to take Hevo for a spin? Sign up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Have any further queries regarding Amazon Redshift Bulk Load? Get in touch with us in the comments section below.

Talha
Software Developer, Hevo Data

Talha is a Software Developer with over eight years of experience in the field. He is currently driving advancements in data integration at Hevo Data, where he has been instrumental in shaping a cutting-edge data integration platform for the past four years. Prior to this, he spent 4 years at Flipkart, where he played a key role in projects related to their data integration capabilities. Talha loves to explain complex information related to data engineering to his peers through writing. He has written many blogs related to data integration, data management aspects, and key challenges data practitioners face.

No-Code Data Pipeline for Redshift