Airflow ETL – Easy Steps to Execute an ETL Job

on Tutorials • May 26th, 2020 • Write for Hevo

Organizations with a separate transactional database and data warehouse typically have a large number of data engineering activities. They extract, transform, and load data from a variety of sources to their data warehouse. The sources include their own customer transactions, data from SAAS (Software as a Service) offerings as well as third party data which could be beneficial for analyzing customer behavior. Such ETL jobs are managed by ETL frameworks that help in organizing the jobs into directed workflow graphs, monitor them, and keep track of the service level agreements. 

Airflow ETL is one such popular framework that helps in workflow management. It is excellent scheduling capabilities and graph-based execution flow makes it a great alternative for running ETL jobs. This post will help you to learn the basics of Airflow and execute an ETL job to transfer data from Amazon S3 to Redshift.

The goal of this post is to familiarize developers about the capabilities of airflow and to get them started on their first ETL job implementation using Airflow.

Table Of Contents

Understanding Airflow ETL

Airflow ETL

Airflow is an open-source framework and can be deployed in on-premise servers or cloud servers. It has built-in connectors to most of the Industry standard source and target combinations. It also allows writing custom plugins for databases that are not supported out of the box. Airflow provides a directed acyclic graph view which helps in managing the task flow and serves as a documentation for the multitude of jobs. It also has a rich web UI to help with monitoring and job management. Know more here.

Methods To Perform An Airflow ETL Job

Two popular methods that can be used to perform an Airflow ETL job are:

Method 1: Using Airflow as Primary ETL Tool

Making use of custom code to perform an ETL Job is one such way. Airflow works on the basis of a concept called operators. Operators denote basic logical blocks in the ETL workflows. It could be anything from the movement of a file to complex transformations. A task is formed using one or more operators. Multiple tasks are stitched together to form directed acyclic graphs.

Method 2: Execute an ETL job using a No-code Data Pipeline Platform, Hevo

Hevo Data provides a hassle-free & a fully managed solution using its No-Code Data Pipelines. It helps you export the data effortlessly from Airflow without any intervention. Hevo’s pre-built integration with Airflow will take full charge of the data export process, allowing you to focus on key business activities.

Explore by signing up for a 14-day free trial.

Prerequisites

You would need the following before you could move on to performing an Airflow ETL job:

  • Python 3.6 or above. You can click here, in case you need to download Python.
  • Airflow installed and configured in the system. In case you do not have it installed already, you can follow this guide to set it up.
  • An AWS account with permissions for S3 and Redshift.

Methods to Perform an Airflow ETL Job

Two popular methods that can be used to perform an Airflow ETL job are:

Method 1: Using Airflow as a Primary ETL Tool

Airflow works on the basis of a concept called operators. Operators denote basic logical blocks in the ETL workflows. It could be anything from the movement of a file to complex transformations. A task is formed using one or more operators. Multiple tasks are stitched together to form directed acyclic graphs. Every ETL job is a DAG for airflow. 

Performing an Airflow ETL job involves the following steps:

We will now dig deep into each of the above steps of executing an Airflow ETL job.

Step 1: Preparing the Source and Target Environments

Our input file for this exercise looks as below.

1, Alex,addr1,addr2
2,Vlad,addr1,addr2
3,Paul ,addr1,addr2
4,Russ,addr1,addr2

You will now login to Redshift console and create a table to hold this table. Use the below command for this.

create table customer(
id int,
name varchar(100),
address1 varchar(100),
address2 varchar(100)
)

Step 2: Starting the Airflow Web Server

Use the below command to start airflow web server

airflow webserver -p 8080

Once started, you can access the UI at localhost:8080.

Go to localhost:8080 to view the airflow UI.

Airflow ETL

Step 3: Creating a Connection to S3

To create a connection to S3, go to the Admin tab, and select connections.

Connecting with S3

Click on create and select S3 in the ‘conn type’ as below.

S3 Connection Settings

In the ‘Extra’ section, add your AWS credentials below.

{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}

Leave all sections other than ‘conn id’ and ‘conn type’ blank. We will be using the ‘conn id’ when we create DAG in the following steps.

Step 4: Creating a Redshift Connection

Click ‘Create’ in the connections tab and add details as below. 

Creating A Redshift Connection

As above, in the Extras section add the credentials in JSON format. In the ‘conn type’ section use Postgres.

Step 5: Creating the DAG File

The DAG file will use an operator called s3_to_redshift_operator. Place the below file in the ‘dags’ folder located in the airflow installation directory. If this folder does not already exist, feel free to create one and place the file in there. 

import datetime as dt
 
from airflow import DAG
from airflow.operators.s3_to_redshift_operator import S3ToRedshiftTransfer
 
default_args = {
  'owner': 'me',
  'start_date': dt.datetime(2020, 6, 1),
  'retries': 2,
  'retry_delay': dt.timedelta(minutes=5),
}
 
dag = DAG('redshift-etl',
  default_args=default_args,
  schedule_interval='@once'
)

transfer_redshift= S3ToRedshiftTransfer(
    task_id='transfer_redshift',
    schema='schema',
    table= 'table',
    s3_bucket='s3_bucket',
    redshift_conn_id = 'redshift_conn_id',
    aws_conn_id = 'aws_conn_id',
    default_args= 'default_args'
    dag=dag
)

transfer_redshift

The above code defines a DAG and an associated task that uses the default s3_to_redshift_operator. The above code is implemented to run once on a 1-6-2020. It also specifies two retries in case the job fails. After placing this file in the ‘dags’ folder, refresh the webserver UI and you will notice the new DAG appearing as below.

DAG Files

Step 6: Triggering the Job and Monitoring the Results

To trigger the job, use the far left button on the right-hand side of the DAG list. Also, make the OFF button on the left-hand side ON. Once the run button is pressed, switch to the DAG runs view in Admin section and you will notice the status of the job as ‘running’.

DAG File Runs

For monitoring, there is another view called graph view, which shows the full graph of each DAG and status of each task in the graph.

The graph view of our ETL job is as below.  This view is very helpful in case of dags with multiple tasks. 

Graph View Of ETL Job

It shows our task as green, which means successfully completed. Access the Redshift console again and you will find the data copied to Redshift table.

And that concludes our steps to execute this simple S3 to Redshift transfer. Airflow is capable of handling much more complex DAGs and scheduling scenarios. That said, it is not without its limitations. You will now learn about some of the typical challenges in using airflow as your primary ETL tool.

Challenges Involved in using Airflow as a Primary ETL Tool

  1. The above transfer works fine in case of one-off loads. But typically the requirement is for a continuous load. In this case, a staging table and additional logic to handle duplicates will all need to be part of the DAG. 
  2. Airflow works based on operators. Even though there are many built-in and community-based operators available, support for SAAS offerings is limited in airflow. If you are someone who uses a lot of SAAS applications for running your business, your developers will need to implement airflow plugins to connect to them and transfer data. 
  3. Even though airflow provides a web UI, the DAG definition is still based on code or configuration.
  4. Airflow is primarily a workflow engine and the execution of transformation happens in either source or target database. This means the developers need to be an expert in both source and destination capabilities and should spend extra effort in maintaining the execution engines separately.
  5. Transformation operators in Airflow are limited and in most cases, developers will have to implement custom ones. 

If all of those challenges seem too much to address and you want your developers to focus on your core business logic, rather than spending time on maintaining and customizing an ETL framework, a cloud-based ETL tool like Hevo can be a great option for you.

Method 2: Execute an ETL job using a No-code Data Pipeline Platform, Hevo

Using Hevo will enable you to transfer data from Amazon S3 to Redshift within minutes without the involvement of manual scripts. Unlike Airflow ETL, Hevo works completely based on cloud and the user need not maintain any infrastructure at all. With Hevo, You can execute an ETL job from S3 to Redshift in two easy steps.

Step 1: Set-up the Source by Configuring Amazon S3

Connecting S3 data source

Step 2: Connect your Redshift Data Warehouse to Transfer Data

Connecting Redshift destination

Well, that is all! Hevo will now stream data from S3 to Redshift in real-time. Sign up for a risk-free 14-day free trial here to take Hevo a whirl!

More Reasons to Try Hevo:

  1. Minimal Setup Time: Hevo can be set up on a visual interface in no time. This ensures that your ETL projects come to life instantly.
  2. 100’s of Data Integrations: Hevo can seamlessly ETL data from a wide array of data sources – databases, cloud applications, analytics applications, file storage, and so on. Explore the complete integration list here.
  3. Robust Data Transformations: Hevo can easily transform the data both before and after loading it into the warehouse. This ensures that the data is always ready for analysis.
  4. Automatic Schema Mapping: Once you have connected your data source, Hevo automatically detects the complete schema of the incoming data and maps it to the destination tables. This rids you of any additional schema mapping hassles when dealing large sets of data.
  5. Incremental Data Load: Hevo allows you to transfer the data that has changed or modified since the last data sync. This will ensure that there’s efficient utilization of bandwidth at both your source and destination systems.
  6. Dedicate Support: Hevo provides support over email, call, and chat, 24*7, ensuring that you always have a reliable partner to solve your ETL troubles in the hour of need.

While Airflow ETL is a great means to set up and manage your ETL pipeline free of cost, it is not without its limitations. If you are looking for a seamless way to set up your data pipeline infrastructure, do try out Hevo by signing up for a 14-day free trial here.

No-code Data Pipeline For Your Data Warehouse