Modern ETL architectures are all about exploiting the simplicity of completely managed cloud database services and maximizing the value for your ETL spend by paying only for what you use. There are various kinds of completely managed database services optimized for different use cases like online transactional processes, online analytical processing, etc. Naturally, when you use different kinds of databases, there would be a need to migrate data between them frequently. A specific use case that comes up frequently is the transfer of data from your transactional database to your data warehouse. This blog post details the steps needed to migrate data from AWS DynamoDB to Redshift.
What is AWS DynamoDB?
AWS DynamoDB is a completely managed NoSQL database offered as a service by Amazon. It is a document-based database which stores data as records of keys and values. Each record is mandated to have a primary key that will be used to retrieve the data. It provides a proprietary querying layer that can execute queries based on the primary key and the key-value pairs of individual records. It is a highly available database service with built-in mechanisms for point-in-time backups and in-memory caching.
Because of its nature of being a NoSQL database, DynamoDB does not impose any structure on the data other than for the key attributes. Any non-key attribute can be added on the fly to the new records without the database complaining. Likewise, any non-key attributes that are part of the earlier records are not mandatory for the next set of records.
Pricing for DynamoDB is in terms of the number of requests serviced and occupied data storage. Scaling can be done on an on-demand basis or based on a provisioned upper limit. DynamoDB also offers a mechanism called streams. Streams provide triggers to typical database changes. These triggers can be utilized for implementing specific actions connected to database changes.
DynamoDB works on the basis of the collection of nodes with each node managing a particular set of primary keys. When a query gets executed, only the nodes that contain primary keys related to that query gets activated. This also means the provisioned capacity is always divided equally between the nodes. If there are frequently accessed primary keys, then the capacity of the whole database needs to be increased to cater to the higher demand of the particular keys.
What is AWS Redshift?
Amazon Redshift is a managed data warehouse offered as a service. Redshift stores data in a structured format and provides a Postgres compatible querying layer. Redshift can scale up to 2 PB of data and allows the customers to pay only for what they use. Pricing is based on the number of instances required in the use case. Redshift offers two kinds of instance types – Compute-optimized and Storage-optimized; from which the customers can choose based on their use case.
Redshift architecture includes a cluster of nodes. One among the nodes is designated as a leader node and others as compute nodes. The leader node is responsible for client communication, query optimization and assigning tasks to the compute node. Redshift also offers a service called Redshift spectrum which allows the customers to process data that are located in S3 and not yet added as Redshift tables.
The Redshift cluster is scaled by adding more nodes or upgrading nodes. The newer generation nodes equipped with elastic resize features ensure that scaling can be accomplished in a matter of minutes with minimum downtime. Redshift can also scale automatically in case of a higher workload through a feature called concurrency scaling. Redshift users get access to free one hour of concurrency scaling credit for every 24 hours a cluster stays active.
Copying data from DynamoDB to Redshift
Copying data from DynamoDB to Redshift can be accomplished in three ways depending on the use case.
- Using the Redshift COPY command
- Using Amazon’s own service AWS Data pipeline
- Implementing a custom application using DynamoDB streams
- Using a quick-setup, hassle-free Data Integration Platform
Each of these three methods is suited for the different use cases and involves a varied range of effort. Let’s dive in.
DynamoDB to Redshift: Using Redshift’s COPY command
This is by far the simplest way to copy a DynamoDB table to Redshift. Redshift’s COPY command can accept a DynamoDB URL as one of the inputs and manage the copying process on its own. The syntax for the COPY command is as below.
copy <target_tablename> from 'dynamodb://<source_table_name>' authorization read ratio '<integer>';
For now, let’s assume we need to move product_details_v1 table from DynamoDB to a target table in Redshift named product_details_tgt. The command to move data will be as follows.
COPY product_details_v1_tgt from dynamodb://product_details_v1 credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key> readratio 40;
The readratio parameter in the above command specifies the amount of provisioned capacity in the DynamoDB instance that can be used for this operation. This operation is usually a performance intensive one and it is recommended to keep this value below 50 % to avoid the source database getting busy.
Limitations of Using Redshift’s Copy Command to Load Data from DynamoDB to Redshift
The above command may look easy, but in real life, there are multiple problems that a user needs to be careful about while doing this. A list of such critical factors that should be considered is given below.
- DynamoDB and Redshift follow different sets of rules for their table names. While DynamoDB allows for use of up to 255 characters to form the table name, Redshift limits it to 127 characters and prohibits the use of many special characters including dots and dash. Add to that Redshift table names are case insensitive
- While copying Redshift tries to map between DynamoDB attribute names and Redshift column names. If there is no match for a Redshift column name, it is populated as empty or NULL depending on the value of EMPTYASNULL parameter configuration parameter in COPY command
- All the attribute names in DynamoDB that cannot be matched to column names in Redshift are discarded
- At the moment, the COPY command only supports STRING and NUMBER data types in DynamoDB
- The above method works well when the copying operation is a one-time operation. The COPY command by default always appends the new data to the existing table. Since Redshift lacks the ability to enforce primary key constraints, this can be a problem, if the copying operation is a recurring activity. In such cases, a staging table will need to be used. The idea is to copy all data to a Redshift staging table, Use an inner join to find the rows that have the common columns, delete those columns first and then copy from the staging table to the target table
- The COPY command by default uses automatic compression in case the target table is empty and no compression encoding is specified. This requires some data from DynamoDB to be sampled. The number of rows to be sampled is specified through the value of COMPROWS parameter. Data transfer required in this operation is also part of the DynamoDB table’s provisioned capacity
It is completely understandable if an inexperienced user finds all these intricate details a bit too overwhelming. A hassle-free ETL platform like Hevo (covered below) could provide a lifeline in such cases. Hevo provides a very easy to use interface which can configure data loads through simple point-and-click.
DynamoDB to Redshift: Using AWS Data Pipeline
AWS Data pipeline is Amazon’s own service to execute the migration of data from one point to another point in the AWS ecosystem. Unfortunately, it does not directly provide us with an option to copy data from DynamoDB to Redshift but gives us an option to export DynamoDB data to S3. From S3, we will need to used a COPY command to recreate the table in S3. Broadly, this involves the below steps.
- Create an AWS Data pipeline from the AWS management console and select the option ‘Export DynamoDB table to S3’ in the source option in the below screen:A detailed account of how to use the AWS data pipeline can be found at the blog post here.
- Once the pipeline completes the export, use the COPY command with the source path as the JSON file location. The COPY command is intelligent enough to autoload the table using JSON attributes. The following command can be used to accomplish the same.
COPY product_details_v1_tgt from s3://my_bucket/product_details_v1.json credentials ‘aws_access_key_id = <access_key_id>;aws_secret_access_key=<secret_access_key> Json = ‘auto’
In the above command, product_details_v1.json is the output of AWS Data pipeline execution.
Alternately instead of the ‘auto’ argument, a JSON file can be specified to map the JSON attribute names to Redshift columns, in case those two are not matching
DynamoDB to Redshift – Using DynamoDB Streams
The above two methods are fine if the use case requires only periodic copying of the DynamoDB table to the Redshift table. There are specific use cases where a real-time syncing of DynamoDB and Redshift is needed. In such cases, DynamoDB’s streams feature can be exploited to design a streaming copy pipeline.
DynamoDB stream provides a time-ordered sequence of records that correspond to item level modification in a DynamoDB table. This item-level record of table activity can be used to recreate an item level table activity in Redshift using a client application that can consume this stream. Amazon has designed the DynamoDB streams to adhere to the architecture of Kinesis streams. This means the customer just needs to create a kinesis firehose delivery stream to exploit the DynamoDB stream data. The following are the broad set of steps involved in this method.
- Enable DynamoDB stream in the DynamoDB console dashboard
- Configure a kinesis firehose delivery stream to consume the DynamoDB Stream to write this data to S3
- Implement an AWS Lambda function to buffer the data from the firehose delivery stream, batch it and apply the required transformations
- Configure another kinesis data firehose to automatically insert this data to Redshift
Even though this method requires the user to implement custom functions, it provides unlimited scope for transforming the data before writing to Redshift.
DynamoDB to Redshift – Using Hevo Data
Depending on whether the use case demands a one time copy or continuous sync, one of the above methods can be chosen. A caveat with all these methods is that there is a large number of configuration parameters and intricate details to be considered for a successful implementation.
With a ready-to-use Data Integration Platform like Hevo Data, you can easily move data from not just DynamoDB, but 100s of different data sources to Redshift. Hevo can help you accomplish this in real-time, without having to write any code.
Here are the 3 simple steps you need to use to move data from DynamoDB to Redshift using Hevo:
- Connect your DynamoDB source by entering the credentials
- Configure the data that you want to load
- Configure the Redshift Data Warehouse where you want to move data
That is it! Hevo will take care of reliably moving data from DynamoDB to Redshift with zero data loss.
Hevo can handle all the limitations posed by the other methods mentioned above and provide a hassle-free data migration experience. This will in-turn save the bandwidth of your precious engineering resources, who otherwise might be tied up juggling between monitoring and maintaining your data pipelines.
Sign up for a 14-day free trial here to explore Hevo.
Since DynamoDB is usually used as a transactional database and Redshift as a data warehouse, the need to copy data from DynamoDB is very common. Depending on whether the use case demands a one time copy or continuous sync or fast implementation (or all) one of the above methods can be chosen.
What methods have worked for you when moving data from DynamoDB to Redshift? Let us know in the comments.