Amazon Redshift COPY Command: 3 Comprehensive Aspects

on Data Warehouse, Tutorial • October 1st, 2019 • Write for Hevo

Redshift Copy Command | Hevo Data

Increasingly, businesses are moving towards completely managed data warehouse services. This is because of the flexibility and ease of maintenance offered by them. Amazon Redshift makes a strong proposition among the data warehouse services because of its ability to execute complex queries over millions of rows with quick results. 

Redshift’s underlying architecture contains a cluster of instances that the user can customize based on his use case and budget. The customer is only required to pay for the storage and computing power he is using and is completely relieved of all the pain associated with maintaining the instances.

In this post, we will cover the Amazon Redshift COPY Command which is the standard way of bulk inserting data from another source. The post covers the following:

Table of Contents

What is Amazon Redshift?

Amazon Redshift is a fully managed, cloud-based, petabyte-scale data warehouse service by Amazon Web Services (AWS). It is an efficient solution to collect and store all your data and enables you to analyze it using various business intelligence tools to acquire new insights for your business and customers.

Amazon Redshift offers a Postgres-based query layer that can provide very fast results even when the query spans millions of rows. From an architectural point of view, Amazon Redshift is based on a cluster of nodes, one of which acts as a leader node and the others as compute nodes. The leader node manages communication with clients, creates execution plans for queries, and assigns tasks to compute nodes. Moreover, Amazon ​Redshift can be scaled by adding more nodes, updating existing nodes, or both. Next-gen node-based Redshift clusters with elastic scaling can scale in minutes with very short downtime.

Key Features of Amazon Redshift

Amazon Redshift is popular in the market due to the following features:

  •  Performance: With Amazon Redshift, when it comes to queries that are executed frequently, the subsequent queries are usually executed faster. This is because Redshift spends a good portion of the execution plan optimizing the query. Amazon Redshift has an architecture that allows massively parallel processing using multiple nodes, reducing the load times.
  • Scaling: Amazon Redshift has the ability to scale quickly, letting customers adjust the extent depending on their peak workload times. Redshift supports restoring data from a snapshot and spinning up a cluster.
  • Pricing: Amazon Redshift prices are calculated based on hours of usage. So you can control your expenses by spinning up clusters only when required. You can start at $0.25 per hour and scale up to your needs. A more detailed look into pricing can be found here.
  • ETL: Redshift has a COPY command which is used to load data. But for this, the data needs to be EC2. In case this data is already in Redshift, the COPY command creates duplicate rows. To overcome the complexity that results from these problems, you can use Hevo, which ensures unique records. Amazon provides AWS Glue and AWS Data Pipeline which make it easier to perform ETL. These work well for AWS services but are not so great when it comes to non-AWS services. In that case, you should certainly checkout Hevo as it has integrations with many databases and cloud applications.

What is Amazon Redshift COPY Command?

Amazon Redshift is a Cloud-based Data Warehouse service that provides a robust way to gather and store large amounts of data for analysis and manipulation. Its design consists of a collection of computing nodes which are further organized into a few large groups called clusters. This structure allows it to process data at a fast speed and offer high scalability to users.

The Redshift COPY Command is a very powerful and flexible interface to load data to Redshift from other sources. That said, it does have its share of limitations, specifically when it comes to enforcing data types and handling duplicate rows. Since it is a very comprehensive interface with a large number of configuration possibilities, it can also feel a bit overwhelming for a beginner user. 

You can read about Amazon Redshift’s features, here.

Need for Amazon Redshift COPY Command

Redshift is a data warehouse and hence there is an obvious need to transfer data generated at various sources to be pushed into it. One option here is to use Redshift’s INSERT INTO command, but this command is best suited for inserting a single row or inserting multiple rows in case of intermittent streams of data.

This is not optimized for throughput and can not exploit any sort of parallel processing. Hence, the need for a different command which can be used in inserting bulk data at the maximum possible throughput and speed. Let’s now look into specific parameters of the Amazon Redshift COPY Command and the exact details of executing the command.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, its and Streaming Services and simplifies the ETL process. It supports 100+ data sources and loads the data onto the desired Data Warehouse like Amazon Redshift, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

How to Use the Amazon Redshift COPY Command?

The Amazon Redshift COPY Command follows the below template:

COPY table-name 

[ column-list ]

FROM data_source

authorization

[ [ FORMAT ] [ AS ] data_format ] 

[ parameter [ argument ] [, ... ] ]

The field table_name represents the target table here. This is a required parameter. 

The parameter column-list is an optional parameter. This parameter takes a list of columns to which Redshift will try to map the incoming data. This can also take a JSON file where the customer can specify the mapping expressions to assign incoming data to specific Amazon Redshift columns.

The data_source field takes the source location as the value. This is a mandatory field and can take either of the below sources

  • An S3 location.
  • A dynamoDB URL
  • An EMR location
  • An EC2 instance

The authorization field represents AWS security credentials.

FORMAT keyword is an optional field that when used with an argument helps the user specify the format of the source data. The possible arguments are:

  • CSV
  • DELIMITER – Any file with a specific field delimiter
  • FIXEDWIDTH – Data files with fixed column width
  • AVRO
  • JSON
  • ENCRYPTED – An encrypted data file. A ‘symmetric key’ to decrypt the files must be specified
  • BZIP2
  • GZIP
  • LZOP
  • PARQUET
  • ORC
  • ZSTD

Amazon Redshift COPY Command can also take in conversion commands that can alter the behavior of implicit conversions that the command will do while inserting data. The following are the possible conversion arguments.

  • ACCEPTANYDATE
  • ACCEPTINVCHARS
  • BLANKSASNULL
  • DATEFORMAT
  • EMPTYASNULL
  • ENCODING
  • ESCAPE
  • EXPLICIT_IDS
  • FILLRECORD
  • IGNOREBLANKLINES
  • IGNOREHEADER
  • NULL AS
  • REMOVEQUOTES
  • ROUNDEC
  • TIMEFORMAT
  • TRIMBLANKS
  • TRUNCATECOLUMNS

Other than these, the Amazon Redshift COPY Command can also take arguments to control the load process like error behavior, compression, etc.

What are the required parameters for Redshift COPY Command?

  • table-name: The name of the COPY command’s target table. In the database, the table must already exist. It is possible for the table to be temporary or permanent. The COPY command appends new input data to any table rows that already exist.
  • FROM data-source: The source data that will be loaded into the target table is stored in this field. With some data sources, a manifest file can be specified. An Amazon S3 bucket is the most widely used data repository. You can also load directly from a DynamoDB table or from data files stored in an Amazon EMR cluster, an Amazon EC2 instance, or a remote host that your cluster can connect to via SSH.

Examples of Amazon Redshift COPY Command

The following are some examples to give you an idea regarding the usage of Amazon Redshift COPY Command:

  • Consider the case of copying data from an S3 location.  The command will be as below.
copy catdemo from 's3://strecords/temp/st_records.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
  • Instead of using an S3 file URL directly, it is possible to provide a manifest with a list of files. This will be as follows. Please note the ‘manifest’ keyword at the end of the command.
copy catdemo from 's3://strecords/temp/st_records_manifest' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' manifest;

It is advisable to use this wherever possible since the Amazon Redshift COPY command’s parallel processing can take advantage of the split files.

  • Consider the case of copying data from a DyanmoDB instance. The command will be as follows (Where product_details_v1_tgt is the target table in Redshift with the same column structure. Further details on how to copy data from DynamoDB to Redshift can be found in our detailed post here.) :
COPY product_details_v1_tgt from dynamodb://product_details_v1 credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key>   readratio 40; 
  • Amazon Redshift COPY Command can also move data from an EC2 instance or an EMR cluster (Where product_details_v1_tgt is the target table in Amazon Redshift and ’emr://strrecords/myoutput/part-*’  is the place where data is located).
copy product_details_tgt_v1 from 'emr://strrecords/myoutput/part-*'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
  • Amazon Redshift can also copy data from a remote host. The ssh parameters and the command that needs to be executed at the remote host need to be supplied. This is done as below.
copy product_details_tgt_v1 from 's3://mybucket/ssh_manifest'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' ssh;

The important thing to note here is the ‘ssh’ keyword at the end of the command and the use of the manifest file. The manifest file is a JSON file with the following parameters as a list of entries.

  • endpoint’: The IP or host address of the remote system.
  • username’: The username to be used with ssh host.
  • mandatory’: Whether the ssh location is a mandatory data location. If set to true, the command will fail if the connection to a particular location cannot be established.
  • publickey’: the public key of the user.

The manifest file used for the above example is as below.

{      "entries": [   {"endpoint":"<ssh_endpoint>",    "command": "mysql -u USER -e "SELECT * FROM   product_details"",   "mandatory":true,    "publickey": “<public_key>”,    "username": “hevo_user”} ] }

How to Improve Redshift COPY Performance?

As part of the COPY transaction, the Redshift COPY command by default runs two commands:

  • “COPY ANALYZE PHASE 1|2”
  • “COPY ANALYZE $temp_table_name”

When a table is empty, Redshift runs these commands to determine the correct encoding for the data being copied. Extra queries, on the other hand, are useless in the following cases and should be removed:

  • COPYING to a temporary table (i.e. as part of an upsert operation).
  • When the table already contains data, a COPY is performed. An existing table’s data encoding cannot be modified in Redshift. Even if the COPY command finds a better encoding style, without a deep copy operation, changing the table’s encoding is impossible.

A single COPY command produces 18 “analyse compression” commands and a single “copy analyse” command in the following example:

Image Source

Additional queries may slow down other Amazon Redshift queries. They might, for example, saturate the number of slots in a WLM queue, resulting in long wait times for all other queries.

To disable these features during upsert operations, add “COMPUPDATE OFF” and “STATUPDATE OFF” to the COPY command parameters. A COPY command with these options set is shown below:


-- Load data into the staging table
 
COPY users_staging (id, name, city)
 
FROM 's3://.......'
 
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx'
 
COMPUPDATE OFF STATUPDATE OFF;

How to load data from Amazon S3 using COPY Command?

To load a table in parallel from Amazon S3 data files, use the COPY command. You can use an Amazon S3 object prefix or a manifest file to specify the files to be loaded.

The following is the syntax for specifying the files to be loaded by using a prefix:

copy <table_name> from 's3://<bucket_name>/<object_prefix>'
authorization;

The manifest file is a JSON-formatted file that contains a list of the data files that need to be loaded. The following is the syntax for specifying the files to be loaded via a manifest file:

copy <table_name> from 's3://<bucket_name>/<manifest_file>'
authorization
manifest;

The database table that will be loaded must already exist. In the SQL Reference, look up CREATE TABLE for information on how to make a table.

The authorization values give your cluster the AWS authorization it needs to access Amazon S3 objects. See IAM permissions for COPY, UNLOAD, and CREATE LIBRARY for more information on required permissions. Specifying the IAM ROLE parameter and providing the Amazon Resource Name (ARN) for an IAM role with the required permissions is the preferred method of authentication.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

An IAM role is used to authenticate in the following example.

copy customer 
from 's3://mybucket/mydata' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Use the NOLOAD option with the COPY command to validate your data without loading the table.

The following example shows the first few rows of a file named venue.txt that contains pipe-delimited data.

1|Toyota Park|Bridgeview|IL|0
2|Columbus Crew Stadium|Columbus|OH|0
3|RFK Stadium|Washington|DC|0

Split the file into multiple files before uploading it to Amazon S3 so that the COPY command can load it in parallel processing. The number of files in your cluster should be a multiple of the number of slices. Split your load data files into roughly equal-sized chunks, ranging from 1 MB to 1 GB after compression.

The venue.txt file, for example, could be divided into four files:

venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4

The VENUE table is loaded using pipe-delimited data from data files in the Amazon S3 bucket mybucket with the prefix’venue.’

copy venue from 's3://mybucket/venue'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';

The load fails if no Amazon S3 objects with the key prefix’venue’ exist.

What are the Limitations of Redshift COPY Command?

The Amazon Redshift COPY Command although efficient, do carry the following limitations:

  • The Redshift COPY Command is tailor-made for bulk insert and if your use case is about inserting rows one by one, this may not be the best alternative.
  • The users need to be very careful about the implicit data type conversions that the Command does in the background. Otherwise, it could lead to a corrupted database.
  • Redshift COPY Command always appends data to the Redshift cluster. Since Redshift cannot enforce primary key constraints, this could lead to duplicate rows. In such cases, a staging table will need to be used. 

Conclusion

This blog introduced Redshift and its COPY Command. It further provided the details that you can follow to implement the Redshift COPY Command easily. The article also provided some examples representing the use of this command. Furthermore, it also listed down the limitations of using this command for your Data integration process. An easy way to mitigate these limitations is to use a ready-to-use data integration platform like Hevo Data. 

Visit our Website to Explore Hevo

You may want to go perform an analysis of your data generated from multiple sources. This will require you to transfer data from those sources to the Amazon Redshift Data Warehouse using various complex ETL processes. Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources to Cloud-based Data Warehouses like Amazon Redshift, Snowflake, Google BigQuery, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Any more questions on the Amazon Redshift COPY Command? Share your thoughts in the comments.

No-code Data Pipeline for Redshift