Organizations spend a significant amount of money managing their data. They use various tools and platforms to maintain the streamlined data flow for optimal efficiency in their business activities. Managing Data Pipelines now power the modern applications used by companies. Apache Airflow is one of the widely used workflow management platforms to maintain and manage Data Pipelines.

Developers use Airflow to create workflows and execute tasks related to Data Pipelines. It is necessary to make sure that enough resources are available for other tasks as the number of backend tasks increases. Running Airflow Locally helps Developers create workflows, schedule and maintain the tasks. 

Running Airflow Locally allows Developers to test and create scalable applications using Python scripts. In this article, you will learn about the need for using Airflow and the steps for Running Airflow Locally. You will also read how to start airflow and how it helps companies manage Data Pipelines and build scalable solutions.

What is Apache Airflow?

Apache Airflow Logo

Apache Airflow is a popular platform that was created by the open-source community. At its core, Apache Airflow is a workflow management platform that was designed primarily for managing workflows in data pipelines. 

The project was initially launched at Airbnb back in 2014, as the company’s workflows began to get increasingly complex. With the help of Airflow, Airbnb was able to create and optimize its workflows, while monitoring them through a centralized interface. 

From the very start, Airflow was one of Apache’s Incubator projects. The platform itself is entirely coded in Python, and Python scripts are used to create workflows. 

Accomplish seamless Data Migration with Hevo!

Looking for the best ETL tools to connect your data sources? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to: 

  1. Integrate data from 150+ sources(60+ free sources).
  2. Simplify data mapping with an intuitive, user-friendly interface.
  3. Instantly load and sync your transformed data into your desired destination.

Choose Hevo for a seamless experience and know why Industry leaders like Meesho say- “Bringing in Hevo was a boon.”

Get Started with Hevo for Free

Key Features of Apache Airflow

Today, many companies around the globe use Airflow to manage their ML engineering, data, and software engineering pipelines. Apache Airflow was created with four core principles, some of the main features are listed below:

  • Scalable: The modular architecture used by Airflow means that it can be scaled according to the needs of the organization. It utilizes a message queue to easily plan for a vast number of workers. 
  • Elegant: The aim for all Airflow pipelines is to be explicit and lean. Parameters can be established in their core with the help of Jinja templating.
  • Extensible: Airflow lets you define operators and extend your libraries to better suit the environment. Essentially, you can write your operators or integrations. 
  • Dynamic: Since all Airflow pipelines are written in Python, you can also create dynamic pipelines. This means writing code that generates pipelines immediately.

To know more about Apache Airflow, click here.

Need to Use Airflow

There are several different reasons why you would want to use Airflow. Here are some of the main reasons listed below:

  • Great for extracting data: Airflow has a ton of integrations that you can use in order to optimize and run data engineering tasks. 
  • Switch out cron jobs: It’s quite hard to monitor cron jobs. However, instead of manually running ssh to figure out why they failed, you can see whether or not your code runs through a visual UI. When a cron job suffers failure, Airflow will send you a notification.
  • Crawling data: you can also write and create tasks to crawl data over specific periods from the internet. This can then be used to write to your database. For instance, you can set periodic intervals to gain a better understanding of your competitor’s pricing or to download and store all comments on your Facebook page. 
  • Completely transform data: if you want to further process your data, you can also connect with external services using Airflow. 

Steps for Running Airflow Locally

Now that you have a basic understanding of Airflow. In this section, you will learn the steps for Running Airflow Locally, there are a few prerequisites that you need to know about. For starters, you’re going to need some basic knowledge of Python, Bash, and Git. 

More importantly, you’re going to need a decent understanding of some basic Airflow concepts. Then, there are the system requirements. Ideally, you would need Python 3 and pip, and the Heroku CLI for Running Airflow Locally. For Running Airflow Locally on your machine, here are the steps listed below:

Integrate Aftership to BigQuery
Integrate DynamoDB to Databricks
Integrate Oracle to PostgreSQL

Step 1: Creating a Local Project

The first thing is to set up the local project. You need to open the terminal, and then establish the directory where your project is going to be set up. For instance, it’s best if you store it in the home directory, just for the sake of simplicity.

So, open up the terminal and add the following code in:

$ cd ~ # or ~/dev, ~/code, etc
$ mkdir localairflow
$ cd localairflow

Step 2: Installing the Dependencies

Now, you have specified the location. And, the next thing we need is a virtual environment in Python for managing the project effectively, as well as any dependencies. There are several that you can use, such as Anaconda or Virtualenv. Or, if you have any other, that’s also an option. 

To create an environment with Anaconda, use this:

Conda create -n airflow python=3.9
Conda activate airflow

Or, if you are using Virtualenv, use the following command:

$ pip install virtualenv
$ python3 -m venv venv
$ source venv/bin/activate

By the way, if you haven’t yet installed Airflow, you can do this with the following command:

pip install -U apache-airflow

Now, you need to create a home folder for Airflow and define it. To do this, just do the following:

Mkdir -p ~airflow/dags
Export home_airflow=’~/airflow’
Export PATH=$PATH:~/.local/bin

Once you run this code, a folder will be created, with the name “home_airflow”.

Step 3: Running Airflow Locally

Now, you need to init Airflow, and then you will have the option of tweaking the configuration file as you see fit. If you want to init Airflow, here’s what you need to do:

cd ~/airflow
airflow initdb

If you want to set an environment variable to define where Airflow stores elements like the airflow.cfg file for Running Airflow Locally, the log files, or other things, you can also set it as a subroot of the main project directory folder. Here’s how to do that:

$ export home_airflow=~/airflow/airflow_subdir

Now that you have everything set up, you are ready for Running Airflow Locally. You can also execute the webserver in a new terminal window. The following commands are given below.

# Activate airflow env if needed
conda activate airflow
airflow webserver

Now you are Running Airflow Locally on localhost:8080

You will have the option to trigger a few of the DAGs for Running Airflow Locally. For those who don’t know, DAG simply stands for directed acrylic graphs. Essentially, it just describes graphs that are indirect and don’t form cycles. 

It’s essentially a pipeline with nodes, as shown below:

Diagram
Description automatically generated

When you are Running Airflow Locally without the schedular, unpausing any DAGs will not get any of the tasks to run. But, as shown above, after already launching a scheduler, that shouldn’t be a problem. 

That’s it! You have completed installation for Running Airflow Locally.

Conclusion

In this section, you read about Apache Airflow, its key features, and why it is used for workflow management of Data Pipelines. You also learnt the steps for Running Airflow Locally on the localhost. When you first set up Airflow, it starts using an SQLite Database. However, since it doesn’t allow for parallelization, you may outgrow the database relatively quickly. If you want, you can always connect with another Database. While Airflow is great for creating pipelines, it’s generally not a suitable solution for non-technical teams.

Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 150+ data sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.

FAQ on Running Airflow Locally

Can I use Airflow locally?

Yes, you can install and run Apache Airflow on your local machine using pip. This setup allows you to develop and test workflows locally before deploying them to a production environment.

Can you run Airflow without Docker?

Yes, You can run Airflow without Docker by installing it directly on your system using pip. This requires setting up the necessary dependencies and configuring a database for Airflow to use.

Can I use Airflow for free?

Yes, Apache Airflow is an open-source project and is free to use. You can download, install, and run it without any licensing costs.

How do you run an Airflow container?

Run an Airflow container by pulling the official Apache Airflow Docker image and executing it with docker run. For a more complex, multi-container setup, including the web server, scheduler, and database, use Docker Compose.

Does Airflow need a database?

Airflow requires a database to store metadata about DAG runs, task instances, and configurations. Supported databases include SQLite (for testing), MySQL, and PostgreSQL.

Najam Ahmed
Technical Content Writer, Hevo Data

Najam specializes in leveraging data analytics to provide deep insights and solutions. With over eight years of experience in the data industry, he brings a profound understanding of data integration and analysis to every piece of content he creates.

No-code Data Pipeline For your Data Warehouse