Running Airflow Locally: 3 Easy Steps

on Airflow Scheduler, Apache Airflow, DAG, Data Pipeline, Trigger Airflow DAGs, Tutorials • March 2nd, 2022 • Write for Hevo

Running Airflow Locally

Organizations spend a significant amount of money managing their data. They use various tools and platforms to maintain the streamlined data flow for optimal efficiency in their business activities. Managing Data Pipelines now power the modern applications used by companies. Apache Airflow is one of the widely used workflow management platforms to maintain and manage Data Pipelines.

Developers use Airflow to create workflows and execute tasks related to Data Pipelines. It is necessary to make sure that enough resources are available for other tasks as the number of backend tasks increases. Running Airflow Locally helps Developers create workflows, schedule and maintain the tasks. 

Running Airflow Locally allows Developers to test and create scalable applications using Python scripts. In this article, you will learn about the need for using Airflow and the steps for Running Airflow Locally. You will also read how it helps companies manage Data Pipelines and build scalable solutions.

Table of Contents

What is Apache Airflow?

Apache Airflow Logo
Image Source

Apache Airflow is a popular platform that was created by the open-source community. At its core, Apache Airflow is a workflow management platform that was designed primarily for managing workflows in data pipelines. 

The project was initially launched at Airbnb back in 2014, as the company’s workflows began to get increasingly complex. With the help of Airflow, Airbnb was able to create and optimize its workflows, while monitoring them through a centralized interface. 

From the very start, Airflow was one of Apache’s Incubator projects. The platform itself is entirely coded in Python, and Python scripts are used to create workflows. 

Key Features of Apache Airflow

Today, many companies around the globe use Airflow to manage their ML engineering, data, and software engineering pipelines. Apache Airflow was created with four core principles, some of the main features are listed below:

  • Scalable: The modular architecture used by Airflow means that it can be scaled according to the needs of the organization. It utilizes a message queue to easily plan for a vast number of workers. 
  • Elegant: The aim for all Airflow pipelines is to be explicit and lean. Parameters can be established in their core with the help of Jinja templating.
  • Extensible: Airflow lets you define operators and extend your libraries to better suit the environment. Essentially, you can write your operators or integrations. 
  • Dynamic: Since all Airflow pipelines are written in Python, you can also create dynamic pipelines. This means writing code that generates pipelines immediately.

To know more about Apache Airflow, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 30+ free data sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Need to Use Airflow

There are several different reasons why you would want to use Airflow. Here are some of the main reasons listed below:

  • Great for extracting data: Airflow has a ton of integrations that you can use in order to optimize and run data engineering tasks. 
  • Switch out cron jobs: It’s quite hard to monitor cron jobs. However, instead of manually running ssh to figure out why they failed, you can see whether or not your code runs through a visual UI. When a cron job suffers failure, Airflow will send you a notification.
  • Crawling data: you can also write and create tasks to crawl data over specific periods from the internet. This can then be used to write to your database. For instance, you can set periodic intervals to gain a better understanding of your competitor’s pricing or to download and store all comments on your Facebook page. 
  • Completely transform data: if you want to further process your data, you can also connect with external services using Airflow. 

Steps for Running Airflow Locally

Now that you have a basic understanding of Airflow. In this section, you will learn the steps for Running Airflow Locally, there are a few prerequisites that you need to know about. For starters, you’re going to need some basic knowledge of Python, Bash, and Git. 

More importantly, you’re going to need a decent understanding of some basic Airflow concepts. Then, there are the system requirements. Ideally, you would need Python 3 and pip, and the Heroku CLI for Running Airflow Locally. For Running Airflow Locally on your machine, here are the steps listed below:

Step 1: Creating a Local Project

The first thing is to set up the local project. You need to open the terminal, and then establish the directory where your project is going to be set up. For instance, it’s best if you store it in the home directory, just for the sake of simplicity.

So, open up the terminal and add the following code in:

$ cd ~ # or ~/dev, ~/code, etc
$ mkdir localairflow
$ cd localairflow

Step 2: Installing the Dependencies

Now, you have specified the location. And, the next thing we need is a virtual environment in Python for managing the project effectively, as well as any dependencies. There are several that you can use, such as Anaconda or Virtualenv. Or, if you have any other, that’s also an option. 

To create an environment with Anaconda, use this:

Conda create -n airflow python=3.9
Conda activate airflow

Or, if you are using Virtualenv, use the following command:

$ pip install virtualenv
$ python3 -m venv venv
$ source venv/bin/activate

By the way, if you haven’t yet installed Airflow, you can do this with the following command:

pip install -U apache-airflow

Now, you need to create a home folder for Airflow and define it. To do this, just do the following:

Mkdir -p ~airflow/dags
Export home_airflow=’~/airflow’
Export PATH=$PATH:~/.local/bin

Once you run this code, a folder will be created, with the name “home_airflow”.

Step 3: Running Airflow Locally

Now, you need to init Airflow, and then you will have the option of tweaking the configuration file as you see fit. If you want to init Airflow, here’s what you need to do:

cd ~/airflow
airflow initdb

If you want to set an environment variable to define where Airflow stores elements like the airflow.cfg file for Running Airflow Locally, the log files, or other things, you can also set it as a subroot of the main project directory folder. Here’s how to do that:

$ export home_airflow=~/airflow/airflow_subdir

Now that you have everything set up, you are ready for Running Airflow Locally. You can also execute the webserver in a new terminal window. The following commands are given below.

# Activate airflow env if needed
conda activate airflow
airflow webserver

Now you are Running Airflow Locally on localhost:8080

You will have the option to trigger a few of the DAGs for Running Airflow Locally. For those who don’t know, DAG simply stands for directed acrylic graphs. Essentially, it just describes graphs that are indirect and don’t form cycles. 

It’s essentially a pipeline with nodes, as shown below:

Diagram

Description automatically generated
Image Source

When you are Running Airflow Locally without the schedular, unpausing any DAGs will not get any of the tasks to run. But, as shown above, after already launching a scheduler, that shouldn’t be a problem. 

That’s it! You have completed installation for Running Airflow Locally.

Conclusion

In this section, you read about Apache Airflow, its key features, and why it is used for workflow management of Data Pipelines. You also learnt the steps for Running Airflow Locally on the localhost. When you first set up Airflow, it starts using an SQLite Database. However, since it doesn’t allow for parallelization, you may outgrow the database relatively quickly. If you want, you can always connect with another Database. While Airflow is great for creating pipelines, it’s generally not a suitable solution for non-technical teams.

Visit our Website to Explore Hevo

Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 100+ data sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Running Airflow Locally in the comments section below!

No-code Data Pipeline For your Data Warehouse