AWS Data Pipeline is a data movement and data processing service provided by Amazon. Using Data Pipeline you can perform data movement and processing as per your requirement. Data pipeline also supports scheduling of pipeline processing. You can also perform data movement residing on on-prem. Data pipeline provides you various options to customize your resources, activities, scripts, failure handling, etc. In pipeline you just need to define the sequence of data sources, destinations along with data processing activities depending on your business logic and the data pipeline will take care of data processing activities.
Steps to perform Aurora to Redshift replication using AWS Data Pipeline:
1. Select the data from Aurora.
2. Create a data pipeline to perform a full copy of data from Aurora to Redshift. For MySQL/Aurora MySQL to Redshift, AWS Data Pipeline provides an inbuilt template to build the pipeline. We will reuse the template and provide the details.
Check all the pre and post conditions in the pipeline before activating the pipeline.
3. Once the setup is done, activate the pipeline to perform Aurora to Redshift replication.
Pipeline internally generates the following activities automatically:
- RDS to S3 Copy Activity (To stage data from Aurora)
- Redshift Table Create Activity (Create Redshift table if not present)
- Move data from S3 to Redshift
- Perform the cleanup from S3 (Staging)
4. Once the pipeline gets completed, check the data in Redshift.
Pros of moving data from Aurora to Redshift using AWS Data Pipeline
- AWS data pipeline is quite flexible as it provides a lot of built-in options for data handling.
- You can control the instance and cluster types while managing the data pipeline hence you have complete control.
- Data pipeline has already provided inbuilt templates in AWS console which can be reused for similar pipeline operations.
- Depending upon your business logic, condition check and job logic are user-friendly.
- While triggering EMR cluster you can leverage other engines other than Apache Spark i.e. Pig, Hive, etc.
Cons of migrating data from Aurora to Redshift using AWS Data Pipeline
- The biggest disadvantage with the approach is that it is not serverless and the pipeline internally triggers other instance/clusters which runs behind the scene. In case, they are not handled properly, it may not be cost effective.
- Another disadvantage with this approach is similar to the case of copying Aurora to Redshift using Glue, data pipeline is available in limited regions. For the list of supported regions, refer AWS website.
- Job handling for complex pipeline sometimes may become very tricky in handling unless. This still requires a proper development/pipeline preparation skills.
- AWS data pipeline sometime gives non-meaningful exception error, which makes it difficult for a developer to troubleshoot. Requires a lot of improvement on this front.
Simpler Way to transfer data from Aurora to Redshift
Using Hevo Data Integration Platform, you can seamlessly replicate data from Aurora to Redshift using 3 simple steps.
- Connect and configure your Aurora database.
- Select the replication mode: (a) load selected Aurora tables (b) load data via Custom Query (c) load data through Binlog.
- For each table in Aurora choose a table name in Redshift where it should be copied.
While you rest, Hevo will take responsibility for fetching the data and moving it to your destination warehouse. Unlike AWS Data pipeline, Hevo provides you with an error-free, completely controlled set up to transfer data in minutes.
Aurora to Redshift replication using AWS data pipeline is convenient during the cases where you want to have full control over your resources and environment. It is a good service for the people who are competent at implementing ETL solution logic. However, in our opinion, this service has not been effective and not that much success as compared to other data movement service. This service has been launched quite a long back and is still available in a few regions. However, having said that since AWS data pipeline support multi-region data movement, you can Select Pipeline in the nearest region and perform the data movement operation using resources of the region for you movement (be careful about security and compliance).