Many organizations will have a separate database (Eg: Amazon RDS) for all the online transaction needs and another warehouse (Eg: Amazon Redshift) application for all the offline analysis and large aggregation requirements. There is an obvious need to move data across these two databases (Eg: Amazon RDS to Redshift) for the following reasons:
- The online database is usually optimized for quick responses and fast writes. Running large analysis or aggregation jobs over this database will slow down the database and can have an effect on your customer experience.
- The warehouse application can have data from multiple sources and not only the transactional data. There may be third party sources or data sources from other parts of the pipeline that needs to be used for analysis or aggregation.
What the above reasons point to, is a need to move data from the transactional database to the warehouse application on a periodic basis. In this post, we will deal with moving the data between two of the most popular cloud-based transactional and warehouse applications – Amazon RDS and Amazon Redshift.
Introduction to Amazon RDS and Amazon Redshift
Amazon RDS: Amazon RDS provides a very easy to use a transactional database that frees the developer from all the headache related to database service management and keeping the database up. It allows the developer to select the desired backend and focus only on the coding part.
Amazon Redshift: Redshift offers a cloud-based data warehouse with a very clean interface and all the required APIs to query and analyze petabytes of data. It allows the developer to focus only on the analysis jobs and foget all the complexities related to managing such a reliable warehouse service. Learn more about the features of Redshift here.
Performing ETL from Amazon RDS to Redshift
There are two methods to move data from RDS to Redshift. The two ways are:
Method 1: Move data through a no-code, fully managed Data Integration Platform Hevo (explore a 14-day free trial here)
Method 2: Build custom ETL Scripts and schedule jobs to move data
This blog will focus on Method 2, its implementation and limitations. Towards the end, the blog also discusses ways to mitigate the limitations.
Moving Data from RDS to Redshift Using Custom Scripts
For the scope of this post, let us assume RDS is using MySQL as the backend.
The easiest way to do this data copy is to dump all the contents of MySQL and recreate all the tables related to this database at redshift end. Let us look deeply into the steps that are involved there.
- Export RDS table to CSV file:
The first step here is to use mysqldump to export the table into a CSV file. The problem with mysqldump command is that you can use it to export to CSV, only if you are executing the command from MySQL server machine itself. Since RDS is a managed database service, these instances usually do not have enough disk space to hold large amounts of data. To avoid this problem, we need to export the data first to a different local machine or an EC2 instance.
Mysql -B -u username -p password sourcedb -h dbhost -e "select * from source_table" -B | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > source_table.csv
The above command selects the data from the desired table and exports it into a CSV file.
- Copying the source data files to S3:
Once the CSV is generated, we need to copy this data into an S3 bucket from where redshift can access this data. Assuming you have AWS CLI installed in our local computer this can be accomplished using the below command.
aws s3 cp source_table.csv s3://my_bucket/source_table/
- Loading data to Redshift in case of complete overwrite:
This step involves copying the source files into a redshift table using the native copy command of redshift. For doing this, login to AWS management console and navigate to Query Editor from the redshift console. Once in Query editor type the following command and execute.
copy target_table_name from ‘s3://my_bucket/source_table’ credentials access_key_id secret_access_key
Where access_key_id and secret_access_key represents the IAM credentials
- Creating temporary tale for incremental load:
The above steps to load data into Redshift is advisable only in case of a complete overwrite of a redshift table. In most cases, there is already data existing in the redshift table and there is a need to update the already existing primary keys and insert the new rows. In such cases, we first need to load the data from S3 into a temporary table and then insert it to the final destination table.
create temp table stage (like target_table_name)
Note that creating the table using the ‘like’ keyword is important here since the staging table structure should be similar to the target table structure including the distribution keys.
- Delete the rows which are already present in the target table:
begin transaction; delete from target_table_name using stage where targettable_name.primarykey = stage.primarykey;
- Insert the rows from the staging table:
insert into target_table_name select * from stage; end transaction;
The above approach works with copying data to redshift from any type of MySQL instance and not only RDS instance. The issue with using the above approach is that it requires the developer to have access to a local machine with sufficient disk memory. The whole point of using a managed database service is to avoid the problems associated with maintaining such machines. That leads us to another service that Amazon provides to accomplish the same task – AWS Data Pipeline.
Moving Data from RDS to Redshift Using AWS Data Pipeline
AWS Data Pipeline is an easy to use data migration service with built-in support for almost all of the source and target database combinations. We will now look into how we can utilize the AWS data pipeline to accomplish the same task.
As the name suggests AWS Data pipeline represents all the operations in terms of pipelines. A pipeline is a collection of tasks that can be scheduled to run at different times or periodically. A pipeline can be a set of custom tasks or built from a template that AWS provides. For this task, we will use such a template to copy the data.
- Creating a pipeline:
The first step is to log in to https://console.aws.amazon.com/datapipeline/ and clicking on Create Pipeline. Enter the pipeline name and optional description.
- Choosing a built-in template for complete overwrite of Redshift data:
After entering the pipeline name and the optional description, select ‘Build using a template’. From the templates available choose ‘Full Copy of Amazon RDS MySQL Table to Amazon Redshift’
- Providing RDS source data:
While choosing the template, information regarding the source RDS instance, staging S3 location, Redshift cluster instance and EC2 keypair names are to be provided.
- Choosing a template for an incremental update:
In case there is an already existing Redshift table and the intention is to update the table with only the changes, choose ‘Incremental Copy of an Amazon RDS MySQL Table to Amazon Redshift’ as the template.
- Selecting the run frequency:
After filling all the required information, you need to select whether to run the pipeline once or schedule it periodically. For our purpose, we should select to run the pipeline on activation.
- Activating the pipeline and monitoring the status:
The next step is to activate the pipeline by clicking ‘Activate’ and wait until the pipeline runs. AWS pipeline console lists all the pipelines and their status. Once the pipeline is in FINISHED status, you will be able to view the newly created table in redshift.
The biggest advantage of this method is that there is no need for a local machine or a separate EC2 instance for the copying operation. That said, there are some limitations for both these approaches and those are detailed in the below section.
Limitations and challenges of Custom Code Method
- The above methods’ biggest limitation is that while the copying process is in progress, the original database may get slower because of all the load. A workaround is to first create a copy of this database and then attempt the steps on that copy database.
- Another limitation is that this activity is not the most efficient one if this is going to be executed as a periodic job repeatedly. And in most cases in a large ETL pipeline, it has to be executed periodically. In those cases, it is better to use a syncing mechanism which continuously replicates to redshift by monitoring the row level changes to RDS data.
- In normal situations, there will be problems related to data type conversions while moving from RDS to Redshift in the first approach depending on the backend used by RDS. AWS data pipeline solves this problem to an extent using automatic type conversion. More on that in the next point.
- While copying data automatically to Redshift, MYSQL or RDS data types will be automatically mapped to redshift data types. If there are columns that need to be mapped to specific data types in Redshift, they should be provided in pipeline configuration against ‘RDS to Redshift conversion overrides’ parameter. The mapping rule for the commonly used data types is as follows:
MySQL Data Type Redshift Data Type TINYINT, SMALLINT, TINYINT UNSIGNED, SMALLINT UNSIGNED SMALLINT MEDIUMINT INTEGER INT, INT(size) INTEGER INT UNSIGNED, INT UNSIGNED(size) BIGINT BIGINT, BIGINT(size) BIGINT BIGINT UNSIGNED
REAL DOUBLE(size,d) DOUBLE PRECISION DECIMAL(size,d) DECIMAL(size,d) CHAR(size) VARCHAR(size*4) CHAR(size) VARCHAR(size*4) TINYTEXT VARCHAR(255*4) TEXT,MEDIUMTEXT,LONGTEXT VARCHAR(max) BOOLEAN, BOOL, TINYINT BOOLEAN
You now understand the basic way of copying data from RDS to Redshift. Even though this is not the most efficient way of accomplishing this, this method is good enough for the initial setup of the warehouse application. In the longer run, you will need a more efficient way periodically executing these copying operations
AN EASIER WAY TO ETL DATA FROM RDS TO REDSHIFT:
A Data Integration Platform like Hevo Data that works out of the box with RDS and Redshift is a much-evolved alternative to ensure you are never at a disadvantage due to the above limitations. With Hevo, you can get your data from Amazon RDS to Redshift data warehouse in a jiffy. Here are the steps to replicate RDS to Amazon Redshift using Hevo:
- Connect and configure your RDS. In this example, it is MySQL
- Select the replication mode
- (a) load selected MySQL tables (b) load data via Custom Query (c) load data through Binlog
- Configure the Redshift Data Warehouse for data load
Hevo seamlessly integrates with RDS and Redshift ensuring that you see no delay in terms of set up and implementation. Hevo will ensure that the data is available in your warehouse in real time. Hevo’s real-time streaming architecture ensures that you have accurate, latest data in your warehouse.
Before you deploy precious engineering resources at the task of building a pipeline, do explore Hevo by signing up for a 14-Day free trial here.