AWS Data Pipeline offers a web service that helps users define automated workflows for movement and transformation of data. In other words, it offers extraction, load, and transformation of data as a service. Users need not create an elaborate ETL or ELT platform to use their data and can exploit the predefined configurations and templates provided by Amazon. Most operations performed by the Data pipeline involves using computing power other than the source and target database and this power comes from Amazon’s computing services like EMR.
AWS Data Pipeline – The Need
The need for an ETL platform stems from the fact that data in a typical organization is scattered across multiple sources in multiple formats. To make any use of this data to improve business, it needs to be cleaned and transformed into actionable forms. This is not a one time process and needs to be repeated periodically as the data in these sources grow with every single business activity. Traditionally organizations used to build complex internal on-premise networks to accomplish this activity. This meant a lot of effort was spent on developing and maintaining this platform, distracting the workforce from actually creating value from this data. This is where services like Data pipeline comes in, offering all the convenience of a complete ETL platform as a web service.
AWS Data Pipeline – Features
As mentioned earlier, the AWS Data pipeline allows automating workflows between different sources and targets. It supports most of the AWS sources as well as typical on-premise sources like JDBC based databases.
Data pipeline allows users to schedule these operations or chain them on the basis of success or failure of upstream tasks.
It supports comprehensive transformation operations through different service activities like HiveActivity, PigActivity, and SQLActivity. Option for a custom code based transformation is supported through HadoopActivity with its ability to run user-supplied code in an EMR cluster or on-premise cluster.
Customers can choose to start an EMR cluster only when required using the EMR activity and then use a HadoopActivity run their processing or transformation jobs.
It allows the customers to make use of their on-premise system for data sources or transformation, provided these compute resources are set up with data pipeline task runners.
It provides a very flexible pricing regime with the user only having to pay for the time when the compute resources are being used and a flat fee for periodic tasks.
It provides a very simple interface that enables the customers to set up complex workflows just through a few clicks.
AWS Data Pipeline – Core Concepts & Architecture
Conceptually AWS data pipeline is organized into a pipeline definition that consists of the following components.
- Task runners – Task runners are installed in the computing machines which will process the extraction, transformation and load activities. Task runners are responsible for executing these processes as per the schedule defined in the pipeline definition.
- Data nodes – Data nodes represents the type of the data and the location from which it can be accessed by the pipelines. This includes both input and output data elements.
- Activities – Activities represent the actual work that is being performed on the data. Data pipeline supports multiple activities that can be chosen according to the workloads. Typical activities are listed below.
- CopyActivitiy – Used when data needs to be copied from one data node to another.
- EmrActivity – Activity for starting and running an EMR cluster.
- HiveActivity – Runs a hive query
- HiveCopyActivity – Runs a pig script in the AWS EMR cluster.
- RedshiftCOpyActivity – Runs a copy operation to Redshift table.
- ShellCommandActivity – For executing a Linux shell command or a script.
- SQLActivity – Runs an SQL command on supported databases. Data pipeline supports JDBC databases, AWS RDS databases, and Redshift.
- Preconditions – These are pipeline components with conditional statements which must be true for the next pipeline activity to start. These are used for chaining pipeline activities based on custom logic.
- Resources – Resources for an AWS Data pipeline is usually an EMR or an EC2 instance.
- Actions – Data pipelines can be configured to execute certain actions when specific conditions meet or certain events occur. These are typically notifications or termination requests.
AWS Data Pipeline – Pros and Cons
AWS Data pipeline unleashes the full power of an ETL platform in the form of a web service with a very comprehensive control panel. That said, it is not without its cons. The below section details the pros and cons of the service from an ETL developer’s point of view.
- Simple to use control panel with predefined templates for most of AWS databases.
- Ability to spawn clusters and resources only when needed.
- Ability to schedule jobs only on specific time periods.
- Full security suite protecting data while in motion and rest. AWS’s access control mechanism allows fine-grained control over who can use what.
- Fault-tolerant architecture – Relieves users of all the activities related to system stability and recovery.
- The data pipeline is designed for AWS services or in other words AWS world and hence integrates well with all the AWS components. AWS Data Pipeline is not the right option if you need to bring data from different third-party services
- Working with data pipeline and on-premise resources can be overwhelming with multiple installations and configurations to be managed on the compute resources.
- The data pipeline’s way of representing preconditions and branching logic can seem complex to a beginner and to be honest, there are other tools out there which help to accomplish complex chains in an easier way. An example is a framework like Airflow.
AWS Data Pipeline Alternative
As mentioned above, the AWS data pipeline is not without its cons and can make easier jobs seem complex if there are components outside the AWS universe. In such cases, your needs may be better served by a fully-managed data integration platform like Hevo.
Hevo Data Integration Platform
With a setup time of less than a few mins, Hevo can seamlessly bring any data into your destination warehouse in real-time. With its AI-powered fault-tolerant architecture, Hevo promises to stream your data in a secure fashion with Zero Data Loss.
AWS Data Pipeline Pricing
The data pipeline is priced in terms of activities and preconditions that are configured in the console and their frequency of executions. AWS classifies the frequency of executions as low in case of activities that are executed up to once per day. All activities that are executed more than once per day are high-frequency activities. Low-frequency one on AWS are charged at $.6 per month and the one on on-premise systems are charged at $1.5 per month. High-frequency activities start at $1 per month and go up to $2.5 per month for on-premise systems.
All the resources used in the pipeline activity like EC2 instances, EMR clusters, Redshift databases etc are charged at the normal rates and comes above the pipeline pricing. Above mentioned charges are only for the pipeline features.
Working with AWS Data Pipelines
Working with the AWS pipeline is all about pipeline definitions. Let us look into setting up a simple AWS pipeline for copying data from RDS to Redshift. This can be done based on predefined templates from AWS, saving us quite a lot of configuration effort.
- From the AWS console, go to the data pipeline and select ‘Create new pipeline’. This will take you to the pipeline configuration screen
- Enter the name and description of the pipeline and choose a template. Here we choose ‘incremental copy of MySQL RDS to Redshift’. There is also another option to configure pipeline using the Architect application for more advanced use cases.
- After selecting the template, it is time to fill in the parameters for the data nodes we are using in this case. Fill up the parameters for the RDS MYSQL instance.
- Configure the Redshift connection parameters.
- Select the schedule for the activity to run. You can either select a schedule or enable a one time run on activation.
- Next step is to enable the logging configuration. We suggest you enable this for any kind of pipeline activity and point the login directory to an S3 location. This can be very useful for troubleshooting activities later. Click ‘Activate’ and you are good to go.
If your ETL involves AWS ecosystem components only, then AWS Data pipeline is an excellent choice for implementing ETL workflows without having to maintain an ETL infrastructure on your own. That said, it is not without its quirks and we have made an attempt to explain the less elegant bits in the above sections.
If your use spans beyond AWS components or if you are looking to implement a fuss-free ETL, it may be better to use robust data pipeline platform such as Hevo Data that provides much more flexibility along with an enterprise-grade data migration experience.