Amazon Redshift COPY Command: 3 Comprehensive Aspects

on Data Warehouse, Tutorial • October 1st, 2019 • Write for Hevo

Increasingly, businesses are moving towards completely managed data warehouse services. This is because of the flexibility and ease of maintenance offered by them. Amazon Redshift makes a strong proposition among the data warehouse services because of its ability to execute complex queries over millions of rows with quick results. 

Redshift’s underlying architecture contains a cluster of instances that the user can customize based on his use case and budget. The customer is only required to pay for the storage and computing power he is using and is completely relieved of all the pain associated with maintaining the instances.

In this post, we will cover the Amazon Redshift COPY Command which is the standard way of bulk inserting data from another source. The post covers the following:

Table of Contents

Introduction to Amazon Redshift COPY Command

Redshift COPY Command: Amazon Redshift Logo
Image Source

Amazon Redshift is a Cloud-based Data Warehouse service that provides a robust way to gather and store large amounts of data for analysis and manipulation. Its design consists of a collection of computing nodes which are further organized into a few large groups called clusters. This structure allows it to process data at a fast speed and offer high scalability to users.

The Redshift COPY Command is a very powerful and flexible interface to load data to Redshift from other sources. That said, it does have its share of limitations, specifically when it comes to enforcing data types and handling duplicate rows. Since it is a very comprehensive interface with a large number of configuration possibilities, it can also feel a bit overwhelming for a beginner user. 

You can read about Amazon Redshift’s features, here.

Need for Amazon Redshift COPY Command

Redshift COPY Command: Use of Copy Command
Image Source

Redshift is a data warehouse and hence there is an obvious need to transfer data generated at various sources to be pushed into it. One option here is to use Redshift’s INSERT INTO command, but this command is best suited for inserting a single row or inserting multiple rows in case of intermittent streams of data.

This is not optimized for throughput and can not exploit any sort of parallel processing. Hence, the need for a different command which can be used in inserting bulk data at the maximum possible throughput and speed. Let’s now look into specific parameters of the Amazon Redshift COPY Command and the exact details of executing the command.

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, its and Streaming Services and simplifies the ETL process. It supports 100+ data sources and loads the data onto the desired Data Warehouse like Amazon Redshift, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Using the Amazon Redshift COPY Command

The Amazon Redshift COPY Command follows the below template:

COPY table-name 

[ column-list ]

FROM data_source

authorization

[ [ FORMAT ] [ AS ] data_format ] 

[ parameter [ argument ] [, ... ] ]

The field table_name represents the target table here. This is a required parameter. 

The parameter column-list is an optional parameter. This parameter takes a list of columns to which Redshift will try to map the incoming data. This can also take a JSON file where the customer can specify the mapping expressions to assign incoming data to specific Amazon Redshift columns.

The data_source field takes the source location as the value. This is a mandatory field and can take either of the below sources

  • An S3 location.
  • A dynamoDB URL
  • An EMR location
  • An EC2 instance

The authorization field represents AWS security credentials.

FORMAT keyword is an optional field which when used with an argument helps the user specify the format of the source data. The possible arguments are:

  • CSV
  • DELIMITER – Any file with a specific field delimiter
  • FIXEDWIDTH – Data files with fixed column width
  • AVRO
  • JSON
  • ENCRYPTED – An encrypted data file. A ‘symmetric key’ to decrypt the files must be specified
  • BZIP2
  • GZIP
  • LZOP
  • PARQUET
  • ORC
  • ZSTD

Amazon Redshift COPY Command can also take in conversion commands that can alter the behavior of implicit conversions that the command will do while inserting data. The following are the possible conversion arguments.

  • ACCEPTANYDATE
  • ACCEPTINVCHARS
  • BLANKSASNULL
  • DATEFORMAT
  • EMPTYASNULL
  • ENCODING
  • ESCAPE
  • EXPLICIT_IDS
  • FILLRECORD
  • IGNOREBLANKLINES
  • IGNOREHEADER
  • NULL AS
  • REMOVEQUOTES
  • ROUNDEC
  • TIMEFORMAT
  • TRIMBLANKS
  • TRUNCATECOLUMNS

Other than these, the Amazon Redshift COPY Command can also take arguments to control the load process like error behavior, compression, etc.

Examples of Amazon Redshift COPY Command

The following are some examples to give you an idea regarding the usage of Amazon Redshift COPY Command:

  • Consider the case of copying data from an S3 location.  The command will be as below.
copy catdemo from 's3://strecords/temp/st_records.txt' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2';
  • Instead of using an S3 file URL directly, it is possible to provide a manifest with a list of files. This will be as follows. Please note the ‘manifest’ keyword at the end of the command.
copy catdemo from 's3://strecords/temp/st_records_manifest' iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' region 'us-west-2' manifest;

It is advisable to use this wherever possible since the Amazon Redshift COPY command’s parallel processing can take advantage of the split files.

  • Consider the case of copying data from a DyanmoDB instance. The command will be as follows (Where product_details_v1_tgt is the target table in Redshift with the same column structure. Further details on how to copy data from DynamoDB to Redshift can be found in our detailed post here.) :
COPY product_details_v1_tgt from dynamodb://product_details_v1 credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key>   readratio 40; 
  • Amazon Redshift COPY Command can also move data from an EC2 instance or an EMR cluster (Where product_details_v1_tgt is the target table in Amazon Redshift and ’emr://strrecords/myoutput/part-*’  is the place where data is located).
copy product_details_tgt_v1 from 'emr://strrecords/myoutput/part-*'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
  • Amazon Redshift can also copy data from a remote host. The ssh parameters and the command that needs to be executed at the remote host need to be supplied. This is done as below.
copy product_details_tgt_v1 from 's3://mybucket/ssh_manifest'  iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>' ssh;

The important thing to note here is the ‘ssh’ keyword at the end of the command and the use of the manifest file. The manifest file is a JSON file with the following parameters as a list of entries.

  • endpoint’: The IP or host address of the remote system.
  • username’: The username to be used with ssh host.
  • mandatory’: Whether the ssh location is a mandatory data location. If set to true, the command will fail if the connection to a particular location cannot be established.
  • publickey’: the public key of the user.

The manifest file used for the above example is as below.

{      "entries": [   {"endpoint":"<ssh_endpoint>",    "command": "mysql -u USER -e "SELECT * FROM   product_details"",   "mandatory":true,    "publickey": “<public_key>”,    "username": “hevo_user”} ] }

Limitations of Redshift COPY Command

The Amazon Redshift COPY Command although efficient, do carry the following limitations:

  • The Redshift COPY Command is tailor-made for bulk insert and if your use case is about inserting rows one by one, this may not be the best alternative.
  • The users need to be very careful about the implicit data type conversions that the Command does in the background. Otherwise, it could lead to a corrupted database.
  • Redshift COPY Command always appends data to the Redshift cluster. Since Redshift cannot enforce primary key constraints, this could lead to duplicate rows. In such cases, a staging table will need to be used. 

Conclusion

This blog introduced Redshift and its COPY Command. It further provided the details that you can follow to implement the Redshift COPY Command easily. The article also provided some examples representing the use of this command. Furthermore, it also listed down the limitations of using this command for your Data integration process. An easy way to mitigate these limitations is to use a ready-to-use data integration platform like Hevo Data. 

Visit our Website to Explore Hevo

You may want to go perform an analysis of your data generated from multiple sources. This will require you to transfer data from those sources to the Amazon Redshift Data Warehouse using various complex ETL processes. Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources to Cloud-based Data Warehouses like Amazon Redshift, Snowflake, Google BigQuery, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Any more questions on the Amazon Redshift COPY Command? Share your thoughts in the comments.

No-code Data Pipeline for Redshift