Organizations with a separate transactional database and data warehouse typically have a large number of data engineering activities. They extract, transform, and load data from a variety of sources to their data warehouse.

The sources include their own customer transactions, data from SAAS (Software as a Service) offerings as well as third-party data which could be beneficial for analyzing customer behavior.

This post’s goal is to familiarize developers with Airflow ETL’s capabilities and get them started on their first ETL job implementation using Airflow.

What You’ll Learn

In this blog you’ll learn:

  1. What is Airflow ETL
  2. What are DAGs and how to implement them
  3. How to create ETL jobs with Airflow
  4. Better Airflow Alternative

What You’ll Need

You would need the following before you could move on to performing an Airflow ETL job:

Introduction to Airflow ETL

Airflow Logo

Airflow is an open-source framework and can be deployed in on-premise servers or cloud servers. It has built-in connectors to most of the industry-standard source and target combinations. It also allows writing custom plugins for databases that are not supported out of the box. Airflow provides a Directed Acyclic Graph (DAG) view which helps in managing the task flow and serves as documentation for the multitude of jobs. It also has a rich web UI to help with monitoring and job management.

To know more about Airflow ETL.

Why Use Airflow for ETL?

Airflow is good for ETL due to its ease of building, managing, and scaling data pipelines. In other words, it provides flexibility, and integration with complementary tools and gives one full control over the data workflows. Key features of Airflow that make it suitable for ETL are:

  1. Apache Airflow pipelines are defined in Python, allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
  2. It provides many plug-and-play operators that are ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and many other third-party services.
  3. Anyone with Python knowledge can deploy a workflow without limiting the scope of your pipelines; you can use it to build ML models, transfer data, manage your infrastructure, and more.
Migrate Your Data seamlessly for analysis!

Are you facing challenges migrating your data? Migrating your data can become seamless with Hevo’s no-code intuitive platform. With Hevo, you can:

  1. Automate Data Extraction: Effortlessly pull data from 150+ connectors(and other 60+ free sources).
  2. Transform Data effortlessly: Use Hevo’s drag-and-drop feature to transform data with just a few clicks.
  3. Seamless Data Loading: Quickly load your transformed data into your desired destination.

Try Hevo and join a growing community of 2000+ data professionals who rely on us for seamless and efficient migrations.

Get Started with Hevo for Free

Key Features Required to Design a DAG (Directed Acyclic Graph) for Airflow ETL Workflows

When crafting airflow ETL workflows, a major requirement is DAG, which can significantly enhance the speed and dependability of your data pipelines.

The features required are as follows,

  • Data Processing
    • Batch processing: It is often more efficient to process data in simultaneous batches rather than handling one dataset at a time in case of huge data sets. A practical approach involves dividing the dataset into x number of files and leveraging dynamically mapped tasks to create x instances for them. Such parallel processing ensures that the entire data is processed in a time frame equal to processing a smaller dataset.
    • Data Formats: Using efficient data formats like Avro and Parquet for processing data ensures optimized performance and significant speed to your ETL pipeline.
  • Handling and monitoring errors
    • Retries and Backoff: In Airflow, you have the freedom to define “retries” parameters for your tasks. In case of a task failure, Airflow will make necessary attempts to execute it based on the specified number of retries. In addition, you can set a “retry_delay” parameter to ensure the time Airflow should wait between these retry attempts. When tasks are allowed to retry after a delay, you can avoid immediate failures and enable subsequent successful executions once the required data is accessible.
    • Alerts: Airflow provides built-in email alert functionality, which helps you in setting up alerts to notify you when a task or DAG fails. As an alternative, you can also integrate with external tools like PagerDuty or Slack to incorporate notifier classes into your DAGs.
  • Environment and Infrastructure
    • Scaling: Using scalable tools like Celery for task execution enables you to expand your Airflow setup to handle larger datasets. Also, Airflow’s auto-scaling capability dynamically adjusts to your environment’s compute resources based on your requirements, helps you scale down excess workloads, and optimizes cost-effectiveness.
    • Resources: Allocate appropriate resources for your Airflow setup. Regularly monitoring resources can prevent over- and underutilization. 

Method 1: Using Airflow for performing ETL jobs

Apache Airflow data orchestration is crucial in providing a flexible and scalable platform for designing, scheduling, and monitoring data workflows. Some of the Airflow workflow automation strategies are: use of Directed Acyclic Graphs, Modular task definition, scheduling strategies, etc.

Airflow data transformation can be used to manage and schedule tasks by integrating with external tools to perform actual data transformations.

Airflow works on the basis of a concept called operators. Operators denote basic logical blocks in the ETL workflows. It could be anything from the movement of a file to complex transformations.

A task is formed using one or more operators. Multiple tasks are stitched together to form Directed Acyclic Graphs. Every ETL job is a DAG for airflow. 

Performing an Airflow ETL job involves the following steps:

We will now dig deep into each of the above steps of executing an Airflow ETL job using an Airflow ETL example as follows.

Solve your data replication problems with Hevo’s reliable, no-code, automated pipelines with 150+ connectors.
Get your free trial right away!

Step 1: Preparing the Source and Target Environments

Our input file for this exercise looks as below.

1, Alex,addr1,addr2
2,Vlad,addr1,addr2
3,Paul ,addr1,addr2
4,Russ,addr1,addr2

You will now login to Redshift console and create a table to hold this table. Use the below command for this.

create table customer(
id int,
name varchar(100),
address1 varchar(100),
address2 varchar(100)
)

Step 2: Starting the Airflow Web Server

Use the below command to start the Airflow Web Server

airflow webserver -p 8080

Once started, you can access the UI at localhost:8080.

Go to localhost:8080 to view the airflow UI.

Airflow ETL

Step 3: Creating a Connection to S3

To create a connection to S3, go to the Admin tab, and select connections.

Connecting with S3

Click on create and select S3 in the ‘conn type’ as below.

S3 Connection Settings

In the ‘Extra’ section, add your AWS credentials below.

{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}

Leave all sections other than ‘conn id’ and ‘conn type’ blank. We will be using the ‘conn id’ when we create DAG in the following steps.

Integrate Amazon S3 to Redshift
Integrate Amazon S3 to BigQuery
Integrate Amazon S3 to Databricks

Step 4: Creating a Redshift Connection

Click ‘Create’ in the connections tab and add details as below. 

Creating A Redshift Connection

As above, in the Extras section add the credentials in JSON format. In the ‘conn type’ section use Postgres.

Step 5: Creating the DAG File

The DAG file will use an operator called s3_to_redshift_operator. Place the below file in the ‘dags’ folder located in the airflow installation directory. If this folder does not already exist, feel free to create one and place the file in there. 

import datetime as dt
 
from airflow import DAG
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
 
default_args = {
  'owner': 'me',
  'start_date': dt.datetime(2020, 6, 1),
  'retries': 2,
  'retry_delay': dt.timedelta(minutes=5),
}
 
dag = DAG('redshift-etl',
  default_args=default_args,
  schedule_interval='@once'
)

transfer_redshift= S3ToRedshiftTransfer(
    task_id='transfer_redshift',
    schema='schema',
    table= 'table',
    s3_bucket='s3_bucket',
    redshift_conn_id = 'redshift_conn_id',
    aws_conn_id = 'aws_conn_id',
    default_args= 'default_args'
    dag=dag
)

transfer_redshift

The above code defines a DAG and an associated task that uses the default s3_to_redshift_operator. The above code is implemented to run once on a 1-6-2020. It also specifies two retries in case the job fails. After placing this file in the ‘dags’ folder, refresh the webserver UI and you will notice the new DAG appearing as below.

DAG Files

Step 6: Triggering the Job and Monitoring the Results

To trigger the job, use the far left button on the right-hand side of the DAG list. Also, make the OFF button on the left-hand side ON. Once the run button is pressed, switch to the DAG runs view in Admin section and you will notice the status of the job as ‘running’.

DAG File Runs

For monitoring, there is another view called graph view, which shows the full graph of each DAG and status of each task in the graph.

The graph view of our Airflow ETL job is as below.  This view is very helpful in case of dags with multiple tasks. 

Graph View Of ETL Job

It shows our task as green, which means successfully completed. Access the Redshift console again and you will find the data copied to Redshift table.

And that concludes our steps to execute this simple S3 to Redshift transfer. Airflow is capable of handling much more complex DAGs and scheduling scenarios. That said, it is not without its limitations. You will now learn about some of the typical challenges in using Airflow ETL.

Challenges Involved in using Airflow as a Primary ETL Tool

  1. The above transfer works fine in case of one-off loads. But typically the requirement is for a continuous load. In this case, a staging table and additional logic to handle duplicates will all need to be part of the DAG. 
  2. Airflow ETL work is based on operators. Even though there are many built-in and community-based operators available, support for SAAS offerings is limited in airflow. If you are someone who uses a lot of SAAS applications for running your business, your developers will need to implement airflow plugins to connect to them and transfer data. 
  3. Even though airflow provides a web UI, the DAG definition is still based on code or configuration.
  4. Airflow ETL is primarily a workflow engine and the execution of transformation happens in either source or target database. This means the developers need to be an expert in both source and destination capabilities and should spend extra effort in maintaining the execution engines separately.
  5. Transformation operators in Airflow ETL are limited and in most cases, developers will have to implement custom ones. 

If all of those challenges seem too much to address and you want your developers to focus on your core business logic, rather than spending time on maintaining and customizing an ETL framework, a cloud-based ETL tool like Hevo can be a great option for you.

Method 2: Better Airflow Alternative- Hevo

With Hevo, you can execute an ETL job from S3 to Redshift in the following two easy steps.

Step 1: Set-up the Source by Configuring Amazon S3

Configuring S3 as Source in Hevo

Step 2: Connect your Redshift Data Warehouse to Transfer Data

Configure Redshift destination

Well, that is all! Hevo will now stream data from S3 to Redshift in real-time.

Best Practices for Using Airflow in ETL

A few best practices that you can follow while using Apache Airflow in ETL are:

  1. Modularize Your Code: Break the ETL tasks into smaller, modular, reusable parts.
  2. Use Airflow’s Features: Airflow provides rich monitoring, retry, and alerting features that keep your pipelines robust.
  3. Test Thoroughly: Test your DAGs before deployment.
  4. Document Your Workflows: Clear documentation may help others, mainly future you, understand your pipelines better.

Use Cases of Airflow ETL

You can use Apache Airflow to schedule the following:

  1. ETL pipelines that extract data from multiple sources and run Spark jobs or other data transformations
  2. Automate data collection from various sources, like APIs or databases.
  3. Schedule and manage complex data transformation tasks.
  4. Machine learning model training
  5. Automated generation of reports
  6. Backups and other DevOps tasks

Conclusion

  • While Airflow ETL is a great means to set up and manage your ETL pipeline free of cost, it is not without its limitations.
  • However, setting up an in-house ETL solution demands technical proficiency.
  • Furthermore, you will have to build an in-house solution from scratch if you wish to transfer your data from any source to a Data Warehouse for analysis.
  • The Airflow ETL tool handles ETL workflow scheduling, orchestration, and monitoring aspects.

Frequently Asked Questions

1. What is Airflow Data Tool?

Airflow is an open-source tool used for orchestrating complex workflows and data pipelines. It allows users to define, schedule, and monitor workflows programmatically, enabling automation of data processing tasks.

2. Is Airflow a Data Pipeline?

Yes, Airflow can be used to manage and orchestrate data pipelines.

3. What is ETL Flow?

ETL (Extract, Transform, Load) flow refers to the process of extracting data from various sources, transforming it into a suitable format, and then loading it into a target database or data warehouse for analysis or reporting.

Vivek Sinha
Director of Product Management, Hevo Data

Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.