Press "Enter" to skip to content

Redshift COPY Command – Usage and Examples

Increasingly, businesses are moving towards completely managed data warehouse services. This is because of the flexibility and ease of maintenance offered by them. Amazon Redshift makes a strong proposition among the data warehouse services because of its ability to execute complex queries over millions of rows with quick results. 

Redshift’s underlying architecture contains a cluster of instances that the user can customize based on his use case and budget. The customer is only required to pay for the storage and computing power he is using and is completely relieved of all the pain associated with maintaining the instances. Explore the features of Redshift in detail here.

In this post, we will cover the Redshift COPY command which is the standard way of bulk inserting data from another source. The post covers the following:

  1. The need for COPY command
  2. COPY command – Usage
  3. COPY command – Examples
  4. Limitations of COPY
  5. An easier way to load data to Redshift

Let us deep dive into this.

Redshift Copy Command

Redshift COPY Command – Need

Redshift is a data warehouse and hence there is an obvious need to transfer data generated at various sources to be pushed into it. One option here is to use Redshift’s INSERT INTO command, but this command is best suited for inserting a single row or inserting multiple rows in case of intermittent streams of data.

This is not optimized for throughput and can not exploit any sort of parallel processing. Hence, the need for a different command which can be used in inserting bulk data at the maximum possible throughput and speed. Let’s now look into specific parameters of the COPY command and exact details of executing the command.

Redshift COPY Command – Steps to follow

Redshift COPY command follows the below template:

COPY table-name 

[ column-list ]

FROM data_source

authorization

[ [ FORMAT ] [ AS ] data_format

[ parameter [ argument ] [, ... ] ]

The field table_name represents the target table here. This is a required parameter. 

The parameter column-list is an optional parameter. This parameter takes a list of columns to which Redshift will try to map the incoming data. This can also take a JSON file where the customer can specify the mapping expressions to assign incoming data to specific Redshift columns.

The data_source field takes the source location as the value. This is a mandatory field and can take either of the below sources

  • An S3 location.
  • A dynamoDB URL
  • An EMR location
  • An EC2 instance

The authorization field represents AWS security credentials.

FORMAT keyword is an optional field which when used with an argument helps the user specify the format of the source data. The possible arguments are:

  • CSV
  • DELIMITER – Any file with a specific field delimiter
  • FIXEDWIDTH – Data files with fixed column width
  • AVRO
  • JSON
  • ENCRYPTED – An encrypted data file. A ‘symmetric key’ to decrypt the files must be specified
  • BZIP2
  • GZIP
  • LZOP
  • PARQUET
  • ORC
  • ZSTD

COPY command can also take in conversion commands that can alter the behaviour of implicit conversions that the command will do while inserting data. The following are the possible conversion arguments.

  • ACCEPTANYDATE
  • ACCEPTINVCHARS
  • BLANKSASNULL
  • DATEFORMAT
  • EMPTYASNULL
  • ENCODING
  • ESCAPE
  • EXPLICIT_IDS
  • FILLRECORD
  • IGNOREBLANKLINES
  • IGNOREHEADER
  • NULL AS
  • REMOVEQUOTES
  • ROUNDEC
  • TIMEFORMAT
  • TRIMBLANKS
  • TRUNCATECOLUMNS

Other than these, the COPY command can also take arguments to control the load process like error behaviour, compression, etc.

Redshift COPY Command – Examples

  1. Consider the case of copying data from an S3 location.  The command will be as below.

    copy catdemo
    from 's3://strecords/temp/st_records.txt'
    iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
    region 'us-west-2';

    Instead of using an S3 file URL directly, it is possible to provide a manifest with a list of files. This will be as follows. Please note the ‘manifest’ keyword at the end of the command.

    copy catdemo
    from 's3://strecords/temp/st_records_manifest'
    iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
    region 'us-west-2'
    manifest;

    It is advisable to use this wherever possible since the COPY command’s parallel processing can take advantage of the split files.

  2. Consider the case of copying data from a DyanmoDB instance. The command will be as follows.
    COPY product_details_v1_tgt from dynamodb://product_details_v1
    credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key>  
    readratio 40;
    

    Where product_details_v1_tgt is the target table in Redshift with the same column structure.
    Further details on how to copy data from DynamoDB to Redshift can be found in our detailed post here.

  3. Redshift COPY command can also move data from an EC2 instance or an EMR cluster.
    copy product_details_tgt_v1
    from 'emr://strrecords/myoutput/part-*' 
    iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'

    Where product_details_v1_tgt is the target table in Redshift and ’emr://strrecords/myoutput/part-*’  is the place where data is located.

  4. Redshift can also copy data from a remote host. The ssh parameters and the command that needs to be executed at the remote host needs to be supplied. This is done as below.
    copy product_details_tgt_v1
    from 's3://mybucket/ssh_manifest' 
    iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
    ssh;

    The important thing to note here is the ‘ssh’ keyword at the end of the command and the use of the manifest file. The manifest file is a JSON file with the following parameters as a list of entries.
    ‘endpoint’ – The IP or host address of the remote system.
    ‘username’ – The username to be used with ssh host.
    ‘mandatory’ – Whether the ssh location is a mandatory data location. If set to true, the command will fail if the connection to a particular location cannot be established.
    publickey’ – the public key of the user.
    The manifest file used for the above example is as below.

    { 
        "entries": [ 
     {"endpoint":"<ssh_endpoint>", 
      "command": "mysql -u USER -e "SELECT * FROM   product_details"",
      "mandatory":true, 
      "publickey": “<public_key>”, 
      "username": “hevo_user”}
    ]
    }

Redshift COPY Command – Limitations

  1. The COPY command is tailor-made for bulk insert and if your use case is about inserting rows one by one, this may not be the best alternative.
  2. The users need to be very careful about the implicit data type conversions that the command does in the background. Otherwise, it could lead to a corrupted database.
  3. COPY command always appends data to the Redshift cluster. Since Redshift cannot enforce primary key constraints, this could lead to duplicate rows. In such cases, a staging table will need to be used. 

An easy way to mitigate these limitations is to use a ready-to-use data integration platform like Hevo Data. 

Hevo – The Easy Way to Load Data to Redshift

Hevo Data provides a code-free interface to load data in any format into Redshift in real-time. Here are some of the key aspects Hevo takes care of for you:

  • Real-time, secure data load to Redshift
  • Automatic schema handling and data type conversion
  • Ability to perform both ETL and ELT
  • Zero data loss guarantee

Explore all of Hevo’s features here. Alternatively, you can sign up for a free trial to experience a hassle-free data load to Redshift.

The Redshift COPY command is a very powerful and flexible interface to load data to Redshift from other sources. That said, it does have its share of limitations, specifically when it comes to enforcing data types and handling duplicate rows. Since it is a very comprehensive interface with a large number of configuration possibilities, it can also feel a bit overwhelming for a beginner user. 

Any more questions on the COPY command? Share your thoughts in the comments.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial