Increasingly, businesses are moving towards completely managed data warehouse services. This is because of the flexibility and ease of maintenance offered by them. Amazon Redshift makes a strong proposition among the data warehouse services because of its ability to execute complex queries over millions of rows with quick results. 

Redshift’s underlying architecture contains a cluster of instances that the user can customize based on his use case and budget. The customer is only required to pay for the storage and computing power he is using and is completely relieved of all the pain associated with maintaining the instances.

In this post, we will cover the Amazon Redshift COPY Command which is the standard way of bulk inserting data from another source. The post covers the following:

What is Amazon Redshift?

Redshift Copy Command: Redshift Logo

Amazon Redshift is a fully managed, cloud-based, petabyte-scale data warehouse service by Amazon Web Services (AWS). It is an efficient solution to collect and store all your data and enables you to analyze it using various business intelligence tools to acquire new insights for your business and customers.

Amazon Redshift offers a Postgres-based query layer that can provide very fast results even when the query spans millions of rows. From an architectural point of view, Amazon Redshift is based on a cluster of nodes, one of which acts as a leader node and the others as compute nodes.

Simplify Amazon Redshift ETL & Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any source, such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services, and simplifies the ETL process. Hevo loads the data onto the desired Data Warehouse, such as Amazon Redshift, in real-time, enriches the data, and transforms it into an analysis-ready form without having to write a single line of code.

Check out why Hevo is the Best for you:

  • Effortlessly extract data from 150+ connectors
  • Tailor your data to Snowflake’s needs with features like drag-and-drop and custom Python scripts.
  • Achieve lightning-fast data loading into Snowflake, making your data analysis-ready.

Try to see why customers like Whatfix and Cure.Fit have upgraded to a powerful data and analytics stack by incorporating Hevo!

Get Started with Hevo for Free

What is Amazon Redshift COPY Command?

Amazon Redshift is a Cloud-based Data Warehouse service that provides a robust way to gather and store large amounts of data for analysis and manipulation. Its design consists of a collection of computing nodes which are further organized into a few large groups called clusters. This structure allows it to process data at a fast speed and offer high scalability to users.

The Redshift COPY Command is a very powerful and flexible interface to load data to Redshift from other sources. That said, it does have its share of limitations, specifically when it comes to enforcing data types and handling duplicate rows. Since it is a very comprehensive interface with a large number of configuration possibilities, it can also feel a bit overwhelming for a beginner user. Get a better understanding of Amazon Redshift’s features.

Need for Amazon Redshift COPY Command

Redshift COPY Command: Use of Copy Command | Hevo Data

Redshift is a data warehouse and hence there is an obvious need to transfer data generated at various sources to be pushed into it. One option here is to use Redshift’s INSERT INTO command, but this command is best suited for inserting a single row or inserting multiple rows in case of intermittent streams of data.

This is not optimized for throughput and can not exploit any sort of parallel processing. Hence, the need for a different command which can be used in inserting bulk data at the maximum possible throughput and speed. Let’s now look into specific parameters of the Amazon Redshift COPY Command and the exact details of executing the command.

How to Use the Amazon Redshift COPY Command?

The Amazon Redshift COPY Command follows the below template:

COPY table-name 

[ column-list ]

FROM data_source

authorization

[ [ FORMAT ] [ AS ] data_format ] 

[ parameter [ argument ] [, ... ] ]

The field table_name represents the target table here. This is a required parameter. 

The parameter column-list is an optional parameter. This parameter takes a list of columns to which Redshift will try to map the incoming data. This can also take a JSON file where the customer can specify the mapping expressions to assign incoming data to specific Amazon Redshift columns.

The data_source field takes the source location as the value. This is a mandatory field and can take either of the below sources

  • An S3 location.
  • A dynamoDB URL
  • An EMR location
  • An EC2 instance

The authorization field represents AWS security credentials.

FORMAT keyword is an optional field that when used with an argument helps the user specify the format of the source data. The possible arguments are:

  • CSV
  • DELIMITER – Any file with a specific field delimiter
  • FIXEDWIDTH – Data files with fixed column width
  • AVRO
  • JSON
  • ENCRYPTED – An encrypted data file. A ‘symmetric key’ to decrypt the files must be specified
  • BZIP2
  • GZIP
  • LZOP
  • PARQUET
  • ORC
  • ZSTD

Amazon Redshift COPY Command can also take in conversion commands that can alter the behavior of implicit conversions that the command will do while inserting data. The following are the possible conversion arguments.

  • ACCEPTANYDATE
  • ACCEPTINVCHARS
  • BLANKSASNULL
  • DATEFORMAT
  • EMPTYASNULL
  • ENCODING
  • ESCAPE
  • EXPLICIT_IDS
  • FILLRECORD
  • IGNOREBLANKLINES
  • IGNOREHEADER
  • NULL AS
  • REMOVEQUOTES
  • ROUNDEC
  • TIMEFORMAT
  • TRIMBLANKS
  • TRUNCATECOLUMNS

Other than these, the Amazon Redshift COPY Command can also take arguments to control the load process like error behavior, compression, etc.

What are the Required Parameters for Redshift COPY Command?

  • table-name: The name of the COPY command’s target table. In the database, the table must already exist. It is possible for the table to be temporary or permanent. The COPY command appends new input data to any table rows that already exist.
  • FROM data-source: The source data that will be loaded into the target table is stored in this field. With some data sources, a manifest file can be specified. An Amazon S3 bucket is the most widely used data repository. You can also load directly from a DynamoDB table or from data files stored in an Amazon EMR cluster, an Amazon EC2 instance, or a remote host that your cluster can connect to via SSH.

Examples of Amazon Redshift COPY Command

The following are some examples to give you an idea regarding the usage of Amazon Redshift COPY Command:

  • Consider the case of copying data from an S3 location.  The command will be as below.
copy catdemo from 's3://strecords/temp/st_records.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
  • Instead of using an S3 file URL directly, it is possible to provide a manifest with a list of files. This will be as follows. Please note the ‘manifest’ keyword at the end of the command.
copy catdemo from 's3://strecords/temp/st_records_manifest' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' manifest;

It is advisable to use this wherever possible since the Amazon Redshift COPY command’s parallel processing can take advantage of the split files.

  • Consider the case of copying data from a DyanmoDB instance. The command will be as follows (Where product_details_v1_tgt is the target table in Redshift with the same column structure. Get further details on how to copy data from DynamoDB to Redshift.) :
COPY product_details_v1_tgt from dynamodb://product_details_v1 credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key>   readratio 40; 
  • Amazon Redshift COPY Command can also move data from an EC2 instance or an EMR cluster (Where product_details_v1_tgt is the target table in Amazon Redshift and ’emr://strrecords/myoutput/part-*’  is the place where data is located).
copy product_details_tgt_v1 from 'emr://strrecords/myoutput/part-*'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
  • Amazon Redshift can also copy data from a remote host. The ssh parameters and the command that needs to be executed at the remote host need to be supplied. This is done as below.
copy product_details_tgt_v1 from 's3://mybucket/ssh_manifest'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' ssh;

The important thing to note here is the ‘ssh’ keyword at the end of the command and the use of the manifest file. The manifest file is a JSON file with the following parameters as a list of entries.

  • endpoint’: The IP or host address of the remote system.
  • username’: The username to be used with ssh host.
  • mandatory’: Whether the ssh location is a mandatory data location. If set to true, the command will fail if the connection to a particular location cannot be established.
  • publickey’: the public key of the user.

The manifest file used for the above example is as below.

{      "entries": [   {"endpoint":"<ssh_endpoint>",    "command": "mysql -u USER -e "SELECT * FROM   product_details"",   "mandatory":true,    "publickey": “<public_key>”,    "username": “hevo_user”} ] }

How to Improve Redshift COPY Performance?

As part of the COPY transaction, the Redshift COPY command by default runs two commands:

  • “COPY ANALYZE PHASE 1|2”
  • “COPY ANALYZE $temp_table_name”

When a table is empty, Redshift runs these commands to determine the correct encoding for the data being copied. Extra queries, on the other hand, are useless in the following cases and should be removed:

  • COPYING to a temporary table (i.e. as part of an upsert operation).
  • When the table already contains data, a COPY is performed. An existing table’s data encoding cannot be modified in Redshift. Even if the COPY command finds a better encoding style, without a deep copy operation, changing the table’s encoding is impossible.

A single COPY command produces 18 “analyse compression” commands and a single “copy analyse” command in the following example:

Redshift Copy Command: Query List

Additional queries may slow down other Amazon Redshift queries. They might, for example, saturate the number of slots in a WLM queue, resulting in long wait times for all other queries.

To disable these features during upsert operations, add “COMPUPDATE OFF” and “STATUPDATE OFF” to the COPY command parameters. A COPY command with these options set is shown below:


-- Load data into the staging table
 
COPY users_staging (id, name, city)
 
FROM 's3://.......'
 
CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxx'
 
COMPUPDATE OFF STATUPDATE OFF;

How to Load Data from Amazon S3 using COPY Command?

To load a table in parallel from Amazon S3 data files, use the COPY command. You can use an Amazon S3 object prefix or a manifest file to specify the files to be loaded.

The following is the syntax for specifying the files to be loaded by using a prefix:

copy <table_name> from 's3://<bucket_name>/<object_prefix>'
authorization;

The manifest file is a JSON-formatted file that contains a list of the data files that need to be loaded. The following is the syntax for specifying the files to be loaded via a manifest file:

copy <table_name> from 's3://<bucket_name>/<manifest_file>'
authorization
manifest;

The database table that will be loaded must already exist. In the SQL Reference, look up CREATE TABLE for information on how to make a table.

The authorization values give your cluster the AWS authorization it needs to access Amazon S3 objects. See IAM permissions for COPY, UNLOAD, and CREATE LIBRARY for more information on required permissions. Specifying the IAM ROLE parameter and providing the Amazon Resource Name (ARN) for an IAM role with the required permissions is the preferred method of authentication.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

An IAM role is used to authenticate in the following example.

copy customer 
from 's3://mybucket/mydata' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Use the NOLOAD option with the COPY command to validate your data without loading the table.

The following example shows the first few rows of a file named venue.txt that contains pipe-delimited data.

1|Toyota Park|Bridgeview|IL|0
2|Columbus Crew Stadium|Columbus|OH|0
3|RFK Stadium|Washington|DC|0

Split the file into multiple files before uploading it to Amazon S3 so that the COPY command can load it in parallel processing. The number of files in your cluster should be a multiple of the number of slices. Split your load data files into roughly equal-sized chunks, ranging from 1 MB to 1 GB after compression.

The venue.txt file, for example, could be divided into four files:

venue.txt.1
venue.txt.2
venue.txt.3
venue.txt.4

The VENUE table is loaded using pipe-delimited data from data files in the Amazon S3 bucket mybucket with the prefix’venue.’

copy venue from 's3://mybucket/venue'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';

The load fails if no Amazon S3 objects with the key prefix ‘venue’ exist.

Integrate Redshift to Redshift
Integrate Amazon S3 to Redshift
Integrate Amazon Ads to Redshift

What are the Limitations of Redshift COPY Command?

The Amazon Redshift COPY Command although efficient, do carry the following limitations:

  • The Redshift COPY Command is tailor-made for bulk insert and if your use case is about inserting rows one by one, this may not be the best alternative.
  • The users need to be very careful about the implicit data type conversions that the Command does in the background. Otherwise, it could lead to a corrupted database.
  • Redshift COPY Command always appends data to the Redshift cluster. Since Redshift cannot enforce primary key constraints, this could lead to duplicate rows. In such cases, a staging table will need to be used. 

Conclusion

This blog introduced Redshift and its COPY Command. It further provided the details that you can follow to implement the Redshift COPY Command easily. The article also provided some examples representing the use of this command. Furthermore, it also listed down the limitations of using this command for your Data integration process. An easy way to mitigate these limitations is to use a ready-to-use data integration platform like Hevo Data. 

Sign up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

Frequently Asked Questions (FAQs)

1. What is Amazon Redshift COPY command?

The COPY command in Amazon Redshift is used to load large amounts of data into your Redshift tables quickly. It retrieves data from sources like Amazon S3, DynamoDB, or other databases.

2. Does the COPY command overwrite Redshift?

No, the COPY command does not overwrite existing data in a table. It appends new data to the table unless you specify otherwise with the appropriate options.

3. How do I copy from one table to another in Redshift?

To copy data from one table to another in Redshift, you can use the INSERT INTO statement combined with a SELECT statement. 
For example: INSERT INTO target_table SELECT * FROM source_table;

Vivek Sinha
Director of Product Management, Hevo Data

Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.