Understanding Airflow ETL: 2 Easy Methods

• August 25th, 2021

Airflow ETL

Organizations with a separate transactional database and data warehouse typically have a large number of data engineering activities. They extract, transform, and load data from a variety of sources to their data warehouse. The sources include their own customer transactions, data from SAAS (Software as a Service) offerings as well as third-party data which could be beneficial for analyzing customer behavior. Such ETL jobs are managed by ETL frameworks that help in organizing the jobs into directed workflow graphs, monitoring them, and keeping track of the service level agreements. 

Airflow ETL is one such popular framework that helps in workflow management. It has excellent scheduling capabilities and graph-based execution flow makes it a great alternative for running ETL jobs. This post will help you to learn the basics of Airflow and execute an ETL job to transfer data from Amazon S3 to Redshift.

The goal of this post is to familiarize developers with the capabilities of Airflow ETL and to get them started on their first ETL job implementation using Airflow.

Table Of Contents

Introduction to Airflow ETL

Airflow Logo
Image Source

Airflow is an open-source framework and can be deployed in on-premise servers or cloud servers. It has built-in connectors to most of the industry-standard source and target combinations. It also allows writing custom plugins for databases that are not supported out of the box. Airflow provides a Directed Acyclic Graph (DAG) view which helps in managing the task flow and serves as documentation for the multitude of jobs. It also has a rich web UI to help with monitoring and job management.

To know more about Airflow ETL, visit this link.

Methods To Perform Airflow ETL

Method 1: Using Airflow for performing ETL jobs

Making use of custom code to perform an ETL Job is one such way. Airflow works on the basis of a concept called operators. Operators denote basic logical blocks in the Airflow ETL workflows. It could be anything from the movement of a file to complex transformations. A task is formed using one or more operators. Multiple tasks are stitched together to form Directed Acyclic Graphs.

Method 2: Using Hevo’s no code data pipeline for performing ETL jobs

Hevo Data, an Automated No-code Data Pipeline, helps you directly transfer data from Databases, CRMs, SaaS Platforms, and a multitude of other sources to Data Warehouses, Databases, or any other destination of your choice in a completely hassle-free manner. Hevo offers end-to-end Data Management and completely automates the process of collecting your decentralized data and transforming it into an analysis-ready form. Its fault-tolerant architecture ensures high Data Quality and Data Governance for your work without having to write a single line of code.

Hevo is fully managed and completely automates the process of not only loading data from 100+ data sources (including 40+ free sources) sources but also enriching the data and transforming it into an analysis-ready form without any manual intervention. It provides a consistent & reliable cloud-based solution to manage data in real-time and always have analysis-ready data in your desired destination. Hevo takes care of your complex ETL processes and allows you to focus on key business needs and data analysis using a BI tool of your choice.

Download the Guide to Evaluate ETL Tools
Download the Guide to Evaluate ETL Tools
Download the Guide to Evaluate ETL Tools
Learn the 10 key parameters while selecting the right ETL tool for your use case.

Method 1: Using Airflow for performing ETL jobs

Airflow works on the basis of a concept called operators. Operators denote basic logical blocks in the ETL workflows. It could be anything from the movement of a file to complex transformations. Before learning the steps of this method, you need to understand its prerequisites:

You would need the following before you could move on to performing an Airflow ETL job:

A task is formed using one or more operators. Multiple tasks are stitched together to form Directed Acyclic Graphs. Every ETL job is a DAG for airflow. 

Performing an Airflow ETL job involves the following steps:

We will now dig deep into each of the above steps of executing an Airflow ETL job.

Step 1: Preparing the Source and Target Environments

Our input file for this exercise looks as below.

1, Alex,addr1,addr2
2,Vlad,addr1,addr2
3,Paul ,addr1,addr2
4,Russ,addr1,addr2

You will now login to Redshift console and create a table to hold this table. Use the below command for this.

create table customer(
id int,
name varchar(100),
address1 varchar(100),
address2 varchar(100)
)

Step 2: Starting the Airflow Web Server

Use the below command to start the Airflow Web Server

airflow webserver -p 8080

Once started, you can access the UI at localhost:8080.

Go to localhost:8080 to view the airflow UI.

Airflow ETL
Image Source: Self

Step 3: Creating a Connection to S3

To create a connection to S3, go to the Admin tab, and select connections.

Connecting with S3
Image Source: Self

Click on create and select S3 in the ‘conn type’ as below.

S3 Connection Settings
Image Source: Self

In the ‘Extra’ section, add your AWS credentials below.

{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}

Leave all sections other than ‘conn id’ and ‘conn type’ blank. We will be using the ‘conn id’ when we create DAG in the following steps.

Step 4: Creating a Redshift Connection

Click ‘Create’ in the connections tab and add details as below. 

Creating A Redshift Connection
Image Source: Self

As above, in the Extras section add the credentials in JSON format. In the ‘conn type’ section use Postgres.

Step 5: Creating the DAG File

The DAG file will use an operator called s3_to_redshift_operator. Place the below file in the ‘dags’ folder located in the airflow installation directory. If this folder does not already exist, feel free to create one and place the file in there. 

import datetime as dt
 
from airflow import DAG
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
 
default_args = {
  'owner': 'me',
  'start_date': dt.datetime(2020, 6, 1),
  'retries': 2,
  'retry_delay': dt.timedelta(minutes=5),
}
 
dag = DAG('redshift-etl',
  default_args=default_args,
  schedule_interval='@once'
)

transfer_redshift= S3ToRedshiftTransfer(
    task_id='transfer_redshift',
    schema='schema',
    table= 'table',
    s3_bucket='s3_bucket',
    redshift_conn_id = 'redshift_conn_id',
    aws_conn_id = 'aws_conn_id',
    default_args= 'default_args'
    dag=dag
)

transfer_redshift

The above code defines a DAG and an associated task that uses the default s3_to_redshift_operator. The above code is implemented to run once on a 1-6-2020. It also specifies two retries in case the job fails. After placing this file in the ‘dags’ folder, refresh the webserver UI and you will notice the new DAG appearing as below.

DAG Files
Image Source: Self

Step 6: Triggering the Job and Monitoring the Results

To trigger the job, use the far left button on the right-hand side of the DAG list. Also, make the OFF button on the left-hand side ON. Once the run button is pressed, switch to the DAG runs view in Admin section and you will notice the status of the job as ‘running’.

DAG File Runs
Image Source: Self

For monitoring, there is another view called graph view, which shows the full graph of each DAG and status of each task in the graph.

The graph view of our Airflow ETL job is as below.  This view is very helpful in case of dags with multiple tasks. 

Graph View Of ETL Job
Image Source: Self

It shows our task as green, which means successfully completed. Access the Redshift console again and you will find the data copied to Redshift table.

And that concludes our steps to execute this simple S3 to Redshift transfer. Airflow is capable of handling much more complex DAGs and scheduling scenarios. That said, it is not without its limitations. You will now learn about some of the typical challenges in using Airflow ETL.

Challenges Involved in using Airflow as a Primary ETL Tool

  1. The above transfer works fine in case of one-off loads. But typically the requirement is for a continuous load. In this case, a staging table and additional logic to handle duplicates will all need to be part of the DAG. 
  2. Airflow ETL work is based on operators. Even though there are many built-in and community-based operators available, support for SAAS offerings is limited in airflow. If you are someone who uses a lot of SAAS applications for running your business, your developers will need to implement airflow plugins to connect to them and transfer data. 
  3. Even though airflow provides a web UI, the DAG definition is still based on code or configuration.
  4. Airflow ETL is primarily a workflow engine and the execution of transformation happens in either source or target database. This means the developers need to be an expert in both source and destination capabilities and should spend extra effort in maintaining the execution engines separately.
  5. Transformation operators in Airflow ETL are limited and in most cases, developers will have to implement custom ones. 

If all of those challenges seem too much to address and you want your developers to focus on your core business logic, rather than spending time on maintaining and customizing an ETL framework, a cloud-based ETL tool like Hevo can be a great option for you.

Method 2: Using Hevo’s no code data pipeline for performing ETL jobs

Image Source

Hevo Data, a No-code Data Pipeline, helps you directly transfer data from MySQL and 100+ data sources to Redshift and other Data Warehouses, BI tools, or a destination of your choice in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Using Hevo will enable you to transfer data from Amazon S3 to Redshift within minutes without the involvement of manual scripts. Unlike Airflow ETL, Hevo works completely based on the cloud and the user need not maintain any infrastructure at all. Furthermore, Hevo’s 2-step solution can transfer your data from S3 to Redshift in real-time, and prevent you from the complex task of setting up the traditional Airflow ETL solution.

With Hevo, you can execute an ETL job from S3 to Redshift in the following two easy steps.

Step 1: Set-up the Source by Configuring Amazon S3

Configuring S3 as Source in Hevo
Image Source

Step 2: Connect your Redshift Data Warehouse to Transfer Data

Configuring Amazon Redshift as Destination in Hevo
Image Source

Well, that is all! Hevo will now stream data from S3 to Redshift in real-time.

Here are more reasons to try Hevo:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use for aggregation.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

With continuous real-time data movement, ETL your data seamlessly to your destination warehouse with Hevo’s easy-to-setup and No-code interface. Try our 14-day full access free trial.

Get Started with Hevo for Free

Conclusion

While Airflow ETL is a great means to set up and manage your ETL pipeline free of cost, it is not without its limitations. However, setting up an in-house ETL solution demands technical proficiency. Furthermore, you will have to build an in-house solution from scratch if you wish to transfer your data from any source to a Data Warehouse for analysis.

Hevo is an all-in-one cloud-based ETL pipeline that will not only help you transfer data but also transform it into an analysis-ready form. Hevo’s native integration with 100+ sources(including 40+ free sources) ensures you can move your data without the need to write complex ETL scripts. Hevo’s automated data transfer, data source connectors, pre-post transformations are advanced compared to Apache airflow. It will make your life easier and make data migration hassle-free.

Learn more about Hevo

Share your experience of learning about Airflow ETL in the comment section below.