Understanding Apache Airflow: 4 Critical Aspects

• February 7th, 2022

Data plays a crucial role in improving and managing the operations of any organization. However, adopting data-centric processes can often be challenging, as it requires coordinating every task across different heterogeneous systems and integrating them for the deployment of the product. To address such challenges, in 2014, the engineers at Airbnb developed Airflow, an open-source framework that enabled them to write and schedule workflows. And, with the help of the built-in web interface, allowed them to monitor the workflow. After the success of this project, it was adopted under the Apache Software Foundation first as an incubator project in the year 2016 and later as a top-level project in the year 2019.

In this blog, we aim to provide a comprehensive introduction to Airflow that covers everything from installation to its use cases.

Table of Contents

Prerequisites

  • Understanding of automation.

Understanding Apache Airflow

Airflow: Airflow Logo
Image Source

Apache Airflow is an open-source, workflow management platform developed for scheduling and monitoring data pipelines. It is programmed in Python, and the workflows are created through Python scripts. Besides, it was created under the principle of the “configuration as code” workflow platform. The use of Python language helps developers import the libraries and classes to aid them in creating the workflow. 

Key Features of Airflow

  • Easy to Use: A data pipeline can be deployed by anybody who is familiar with the Python programming language. It allows users to build any machine learning models, manage infrastructure, transfer data, and has no restrictions on the scope of the pipelines.
  • Pure Python: It allows users to create data pipelines using standard Python features, including the data time formats for scheduling and the loops to generate tasks dynamically. This helps the users to build the data pipelines as flexibly as possible.
  • Useful UI: Through a robust and modern web application, it enables users to monitor, schedule, and manage the data pipelines. Since it always provides total visibility into the status and logs of completed and ongoing tasks at all times, there is no need to understand the interface style.
  • Robust Integrations: Many operators are available by Airflow and are ready to perform the task on the Google cloud platform, Amazon Web Services, and a variety of other third-party services. As a result, this makes it easy to implement into existing infrastructure and expand into next-generation technologies.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice like in real-time in an effortless manner. Hevo provides support for PostgreSQL as both a source and a destination. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line. 

GET STARTED WITH HEVO FOR FREE

Check Out Some of the Cool Features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ Integrations from sources to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Firebolt, Snowflake Data Warehouses; Databricks, Amazon S3 Data Lakes, MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL databases to name a few.  
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources, that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.

Simplify your Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Introducing Apache Airflow in Python Environment 

Installation

  • Step 1: Install it from PyPI using pip as follows:
pip install apache-airflow
  • Step 2: Initialize the database as follows:
airflow initdb
  1. Step 3: Start the Web Server, the default port is 8080. You can use the following code snippet for the same:
airflow webserver -p 8080
  • Step 4: Start the scheduler to finish this step as follows:
airflow scheduler

Sub-Packages

The apache-airflow PyPI is a basic package that only installs what is needed to get started.

However, it performs some conditional imports of operators that might require some extra dependencies. As a result, sub-packages are installed depending on the users’ requirements.

Following is the list of a few subpackages available:

subpackagecommand
allpip install apache-airflow[all]
s3pip install apache-airflow[s3]
gcp_apipip install apache-airflow[gcp_api]
mysqlpip install apache-airflow[mysql]
postgrespip install apache-airflow[postgres]
hdfs pip install apache-airflow[hdfs]
slackpip install apache-airflow[slack]
hivepip install apache-airflow[hive]
passwordpip install apache-airflow[password]
rabbitmqpip install apache-airflow[rabbitmq]

Architecture and Core Concepts

Apache Airflow enables users to build and run workflows. A workflow is defined as DAG. The following section will shed some light on the working principle of DAGs.

DAGs

DAG is a Directed Acyclic Graph, a series of tasks the user wishes to perform in the Airflow. It specifies dependencies between the tasks and also the order in which to execute and run them. 

Airflow: DAG
Image Source

This type of graph is known as a directed acyclic graph (DAG) because it has directed edges and no loops or cycles. This acyclic characteristic is important since it protects us from running into circular dependencies (as seen in the diagram above, where task 2 depends on task 3 and vice versa). When attempting to execute the graph, these circular dependencies become troublesome since task 2 can only execute after task 3 has been finished, and task 3 can only execute once task 2 has been completed. This logical inconsistency causes a deadlock in which neither task 2 nor 3 can run, preventing the graph from being executed.

Due to circular dependencies, DAG cycles inhibit task execution. There is a clear method to accomplish the three different jobs in acyclic networks (top). Due to the interdependency between tasks 2 and 3, there is no clear execution route in cyclic graphs (bottom). However, it uses the acyclic property of DAGs to resolve and execute these task graphs effectively.

Operators

An operator represents a single task and determines what actually executes when the DAG runs. The DAG makes sure that the operators run in the correct order. Compared to the other dependencies, the operators generally run independently on two different machines. Operators are only loaded by Airflow if they are assigned to a DAG. Here are the different operators that you can access here:

  1. BashOperator
  2. PythonOperator

1. BashOperator

Use the BashOperator to execute commands in a Bash shell. 

Airflow: Bash Operator
Image Source
Templating

You can use Jinja templates to parameterize the bash_command argument.

Airflow: Bash Operator Templating
Image Source
Troubleshooting 

Jinja template not found. Add a space after the script name when directly calling a Bash script with the bash_command argument. This is because it tries to apply a Jinja template to it, which will fail. 

Airflow: Bash Operator Troubleshooting
Image Source

2. PythonOperator

Use the PythonOperator to execute Python callables.

Airflow: Python Operator
Image Source
Passing in arguments

To pass additional arguments to the Python callable, you can use the op_args and op_kwargs arguments.

Airflow: Python Operator passing in arguments
Image Source
Templating

When you set the provide_context argument to True, it passes in an additional set of keyword arguments: one for each of the Jinja template variables and a templates_dict argument. The templates_dict argument is templated, so each value in the dictionary is evaluated as a Jinja template.

Tasks

It is referred to as a task once the operator is initiated. The initiation specifies values while calling the operator and the parameterized task becomes a node in a DAG. A task instance represents an indicative state such as “running”, “success”, “failed”, etc.

Hooks

The hooks are interfaces to external platforms and databases such as S3 (Simple Storage Service), MySQL, HDFS, Hive, etc. They act as building blocks for operators and keep information and the authentication code out of data pipelines centralized in the metadata database. 

Airflow in Clouds

Here are a few instances where this Airflow tool can be leveraged seamlessly on Cloud platforms:

  1. AWS
  2. GCP
  3. Azure

1. AWS

It can be deployed in AWS using services such as EFS/S3 (Elastics File System) for storage and Amazon RDS (Relational Database Service) for its metadata database. It also provides various AWS-specific hooks and operators that allow you to integrate with different services with the AWS cloud platform.

2. Google Cloud Platform (GCP)

It provides many GCP-specific hooks and operators that enable users to integrate with different types of services in the Google Cloud Platform. It can be installed with the apache-airflow-providers-google package.

3. Azure

It can be deployed in Azure utilizing services such as Azure File/Blob Storages for storing files and Azure SQL Database for its metadata database. It provides several Azure-specific hooks and operators that allow users to integrate with different services with the Azure cloud platform.

Use Cases

Following are the real-world examples of how this Apache tool helped businesses reach their desired goals.

  1. Adobe
  2. Plarium
  3. Adyen

1. Adobe

Airflow: Adobe Logo
Image Source

Adobe is a software company famously known for multimedia and creativity products such as Acrobat Reader and Photoshop. The Adobe Experience Platforms uses Apache Airflow’s plugin interface to write custom operators. Its execution engine is used by the Adobe experience platform orchestration service to schedule and execute various data pipelines. It includes a highly comprehensive airflow web UI that provides data pipeline-related insights.

2. Plarium

Airflow: Plarium Logo
Image Source

Plarium is a gaming web platform. It provides over 20 games, including Vikings: War of Clans, The Stormfall franchise, and Raid: Shadow Legends. Creating a cross-platform gaming platform requires a more sophisticated workflow orchestrated for solving tasks related to game development. It comes with a plethora of useful built-in features, including some that are integrative. While creating data pipelines, the DAG model assists users in avoiding mistakes and following the general patterns. As a result, Plarium could manage to simplify the process of building more complex workflows.

3. Adyen

Airflow: Adyen Logo
Image Source

Adyen is an e-commerce company, a financial technology platform that provides end-to-end payments, revenue protection, and finance management in a single solution. With the increasing number of users and teams resulting in huge amounts of data, the company faced a lot of issues. To schedule and execute multiple ETL tasks simultaneously was one of the issues. Its already existing operators made it is easier to write ETL DAGs.

Conclusion

Throughout the blog, we learned that Apache Airflow has a simple user interface and provides versatile Python scripting, making it ideal for data management. We also observed the power of DAGs to work efficiently and how they can be deployed and integrated on various cloud platforms. Moreover, we discovered that because of its reliable features, many large companies rely on it for orchestrating critical data pipelines.

Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-code Data Pipeline for Your Data Warehouse