Modern ETL (Extract, Transform, and Load) architectures are all about exploiting the simplicity of completely managed Cloud Database services and maximizing the value for your ETL spend by paying only for what you use. There are various kinds of completely managed Database services optimized for different use cases like Online Transactional Processes, Online Analytical Processing, etc.
Naturally, when you use different kinds of Databases, there would be a need to migrate data between them frequently. A specific use case that comes up frequently is the transfer of data from your Transactional Database to your Data Warehouse such as transfer/copy data from DynamoDB to Redshift. This article introduces you to AWS DynamoDB and Redshift. It also provides 4 methods (with detailed instructions) that you can use to migrate data from AWS DynamoDB to Redshift.
Table of Contents
- Introduction to AWS DynamoDB
- Introduction to Amazon Redshift
- Methods to Copy Data from DynamoDB to Redshift
Introduction to AWS DynamoDB
AWS DynamoDB is a completely managed NoSQL Database offered as a service by Amazon. It is a Document-Based Database that stores data as records of Keys and Values. Each record is mandated to have a Primary Key that will be used to retrieve the data. It provides an exclusive querying layer that can execute queries based on the Primary Key and the Key-Value pairs of individual records. It is a highly available Database service with built-in mechanisms for point-in-time backups and in-memory caching.
Because of its nature of being a NoSQL Database, DynamoDB does not impose any structure on the data other than for the key attributes. Any non-key attribute can be added on the fly to the new records without the Database complaining. Likewise, any non-key attributes that are part of the earlier records are not mandatory for the next set of records.
Pricing for DynamoDB is in terms of the number of requests serviced and occupied data storage. Scaling can be done on an on-demand basis or based on a provisioned upper limit. DynamoDB also offers a mechanism called Streams. Streams provide triggers to typical Database changes. These triggers can be utilized for implementing specific actions connected to Database changes.
DynamoDB works on the basis of the collection of nodes with each node managing a particular set of Primary Keys. When a query gets executed, only the nodes that contain Primary Keys related to that query get activated. This also means the provisioned capacity is always divided equally between the nodes. If there are frequently accessed Primary Keys, then the capacity of the whole Database needs to be increased to cater to the higher demand of the particular keys.
For more information on AWS DynamoDB, click here.
Introduction to Amazon Redshift
Amazon Redshift is a managed Data Warehouse offered as a service. Amazon Redshift stores data in a structured format and provides a Postgres compatible querying layer. Amazon Redshift can scale up to 2 PB of data and allows the customers to pay only for what they use. Pricing is based on the number of instances required in the use case. Amazon Redshift offers two kinds of instance types: Compute Optimized and Storage Optimized; from which the customers can choose based on their use case.
Amazon Redshift architecture includes a cluster of nodes. One among the nodes is designated as a Leader Node and others as Compute Nodes. The Leader Node is responsible for client communication, query optimization, and assigning tasks to the Compute Node. Amazon Redshift also offers a service called Amazon Redshift spectrum which allows the customers to process data that are located in S3 and not yet added as Amazon Redshift Tables.
The Redshift cluster is scaled by adding more nodes or upgrading nodes. The newer generation nodes equipped with elastic resize features ensure that scaling can be accomplished in a matter of minutes with minimum downtime. Amazon Redshift can also scale automatically in case of a higher workload through a feature called Concurrency Scaling. Amazon Redshift users get access to free 1-hour of Concurrency Scaling credit for every 24 hours a cluster stays active.
For more information on Amazon Redshift, click here.
Methods to Copy Data from DynamoDB to Redshift
Copying data from DynamoDB to Redshift can be accomplished in 4 ways depending on the use case. Following are the ways to copy data from DynamoDB to Redshift:
- DynamoDB to Redshift: Using Redshift’s COPY Command
- DynamoDB to Redshift: Using AWS Data Pipeline
- DynamoDB to Redshift – Using DynamoDB Streams
- DynamoDB to Redshift – Using Hevo Data
Each of these 4 methods is suited for the different use cases and involves a varied range of effort. Let’s dive in.
1) DynamoDB to Redshift: Using Redshift’s COPY Command
This is by far the simplest way to copy a table from DynamoDB to Redshift. Redshift’s COPY command can accept a DynamoDB URL as one of the inputs and manage the copying process on its own. The syntax for the COPY command is as below.
copy <target_tablename> from 'dynamodb://<source_table_name>' authorization read ratio '<integer>';
For now, let’s assume you need to move product_details_v1 table from DynamoDB to Redshift (to a particular target table) named product_details_tgt. The command to move data will be as follows.
COPY product_details_v1_tgt from dynamodb://product_details_v1 credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key> readratio 40;
The “readratio” parameter in the above command specifies the amount of provisioned capacity in the DynamoDB instance that can be used for this operation. This operation is usually a performance-intensive one and it is recommended to keep this value below 50% to avoid the source Database getting busy.
Limitations of Using Redshift’s Copy Command to Load Data from DynamoDB to Redshift
The above command may look easy, but in real life, there are multiple problems that a user needs to be careful about while doing this. A list of such critical factors that should be considered is given below.
- DynamoDB and Redshift follow different sets of rules for their table names. While DynamoDB allows for use of up to 255 characters to form the table name, Redshift limits it to 127 characters and prohibits the use of many special characters including dots and dash. Add to that Redshift table names are case insensitive.
- While copying data from DynamoDB to Redshift, Redshift tries to map between DynamoDB attribute names and Redshift column names. If there is no match for a Redshift column name, it is populated as empty or NULL depending on the value of EMPTYASNULL parameter configuration parameter in the COPY command.
- All the attribute names in DynamoDB that cannot be matched to column names in Redshift are discarded.
- At the moment, the COPY command only supports STRING and NUMBER data types in DynamoDB.
- The above method works well when the copying operation is a one-time operation. The COPY command by default always appends the new data to the existing table. Since Redshift lacks the ability to enforce Primary Key constraints, this can be a problem, if the copying operation is a recurring activity. In such cases, a Staging Table will need to be used. The idea is to copy all data to a Redshift Staging Table, use an inner join to find the rows that have the common columns, delete those columns first, and then copy from the Staging Table to the Target Table.
- The COPY command by default uses automatic compression in case the Target Table is empty and no compression encoding is specified. This requires some data from DynamoDB to be sampled. The number of rows to be sampled is specified through the value of COMPROWS parameter. Data transfer required in this operation is also part of the DynamoDB table’s provisioned capacity.
It is completely understandable if an inexperienced user finds all these intricate details a bit too overwhelming. A hassle-free ETL platform like Hevo (covered below) could provide a lifeline in such cases. Hevo provides a very easy-to-use interface that can configure data loads through simple point-and-click.
2) DynamoDB to Redshift: Using AWS Data Pipeline
AWS Data Pipeline is Amazon’s own service to execute the migration of data from one point to another point in the AWS Ecosystem. Unfortunately, it does not directly provide us with an option to copy data from DynamoDB to Redshift but gives us an option to export DynamoDB data to S3. From S3, we will need to used a COPY command to recreate the table in S3. Follow the steps below to copy data from DynamoDB to Redshift using AWS Data Pipeline:
- Create an AWS Data pipeline from the AWS Management Console and select the option “Export DynamoDB table to S3” in the source option as shown in the image below. A detailed account of how to use the AWS Data Pipeline can be found in the blog post here.
- Once the Data Pipeline completes the export, use the COPY command with the source path as the JSON file location. The COPY command is intelligent enough to autoload the table using JSON attributes. The following command can be used to accomplish the same.
In the below command, product_details_v1.json is the output of AWS Data Pipeline execution. Alternately instead of the “auto” argument, a JSON file can be specified to map the JSON attribute names to Redshift columns, in case those two are not matching.
COPY product_details_v1_tgt from s3://my_bucket/product_details_v1.json credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key> Json = ‘auto’
3) DynamoDB to Redshift: Using DynamoDB Streams
The above 2 methods are fine if the use case requires only periodic copying of the data from DynamoDB to Redshift. There are specific use cases where real-time syncing from DynamoDB to Redshift is needed. In such cases, DynamoDB’s Streams feature can be exploited to design a streaming copy Data Pipeline.
DynamoDB Stream provides a time-ordered sequence of records that correspond to item level modification in a DynamoDB table. This item-level record of table activity can be used to recreate an item-level table activity in Redshift using a client application that can consume this stream. Amazon has designed the DynamoDB Streams to adhere to the architecture of Kinesis Streams. This means the customer just needs to create a Kinesis Firehose Delivery Stream to exploit the DynamoDB Stream data. The following are the broad set of steps involved in this method:
- Enable DynamoDB Stream in the DynamoDB console dashboard.
- Configure a Kinesis Firehose Delivery Stream to consume the DynamoDB Stream to write this data to S3.
- Implement an AWS Lambda Function to buffer the data from the Firehose Delivery Stream, batch it and apply the required transformations.
- Configure another Kinesis Data Firehose to automatically insert this data to Redshift.
Even though this method requires the user to implement custom functions, it provides unlimited scope for transforming the data before writing to Redshift.
4) DynamoDB to Redshift: Using Hevo Data
Depending on whether the use case demands a 1-time copy or continuous sync, one of the above methods can be chosen. A warning with all these methods is that there is a large number of configuration parameters and intricate details to be considered for a successful implementation.
With a ready-to-use Data Integration Platform like Hevo Data, you can easily move data from not just DynamoDB, but 100s of different data sources to Redshift. Hevo can help you accomplish this in real-time, without having to write any code.Get Started with Hevo for free
Here are the 3 simple steps you need to use to move data from DynamoDB to Redshift using Hevo:
- Connect your DynamoDB source by entering the credentials as shown in the image below.
- Configure the data that you want to load as shown in the image below.
- Configure the Redshift Data Warehouse where you want to move data.
That is it! Hevo will take care of reliably moving data from DynamoDB to Redshift with no data loss.
Hevo can handle all the limitations posed by the other methods mentioned above and provide a hassle-free data migration experience. This will in turn save the bandwidth of your precious engineering resources, who otherwise might be tied up juggling between monitoring and maintaining your data pipelines.Sign up here for a 14-day Free Trial!
The article introduced you to AWS DynamoDB and Amazon Redshift. It also provided you with 4 different methods that you can use to copy data from DynamoDB to Redshift. Since DynamoDB is usually used as a Transactional Database and Redshift as a Data Warehouse, the need to copy data from DynamoDB is very common. Depending on whether the use case demands a 1-time copy or continuous sync or fast implementation (or all), one of the above methods can be chosen. You can explore more on Hevo Data and check out why it is the most reliable and easiest method of all.Visit our Website to Explore Hevo
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of copying data from DynamoDB to Redshift in the comment section below!