Organizations spend a significant amount of money managing their data. They use various tools and platforms to maintain the streamlined data flow for optimal efficiency in their business activities. Managing Data Pipelines now power the modern applications used by companies. Apache Airflow is one of the widely used workflow management platforms to maintain and manage Data Pipelines.
Developers use Airflow to create workflows and execute tasks related to Data Pipelines. It is necessary to make sure that enough resources are available for other tasks as the number of backend tasks increases. Running Airflow Locally helps Developers create workflows, schedule and maintain the tasks.
Running Airflow Locally allows Developers to test and create scalable applications using Python scripts. In this article, you will learn about the need for using Airflow and the steps for Running Airflow Locally. You will also read how to start airflow and how it helps companies manage Data Pipelines and build scalable solutions.
What is Apache Airflow?
Apache Airflow is a popular platform that was created by the open-source community. At its core, Apache Airflow is a workflow management platform that was designed primarily for managing workflows in data pipelines.
The project was initially launched at Airbnb back in 2014, as the company’s workflows began to get increasingly complex. With the help of Airflow, Airbnb was able to create and optimize its workflows, while monitoring them through a centralized interface.
From the very start, Airflow was one of Apache’s Incubator projects. The platform itself is entirely coded in Python, and Python scripts are used to create workflows.
Looking for the best ETL tools to connect your data sources? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Simplify data mapping with an intuitive, user-friendly interface.
- Instantly load and sync your transformed data into your desired destination.
Choose Hevo for a seamless experience and know why Industry leaders like Meesho say- “Bringing in Hevo was a boon.”
Get Started with Hevo for Free
Key Features of Apache Airflow
Today, many companies around the globe use Airflow to manage their ML engineering, data, and software engineering pipelines. Apache Airflow was created with four core principles, some of the main features are listed below:
- Scalable: The modular architecture used by Airflow means that it can be scaled according to the needs of the organization. It utilizes a message queue to easily plan for a vast number of workers.
- Elegant: The aim for all Airflow pipelines is to be explicit and lean. Parameters can be established in their core with the help of Jinja templating.
- Extensible: Airflow lets you define operators and extend your libraries to better suit the environment. Essentially, you can write your operators or integrations.
- Dynamic: Since all Airflow pipelines are written in Python, you can also create dynamic pipelines. This means writing code that generates pipelines immediately.
You can also take a look at how you can easily build data pipelines with Apache Airflow.
Need to Use Airflow
There are several different reasons why you would want to use Airflow. Here are some of the main reasons listed below:
- Great for extracting data: Airflow has a ton of integrations that you can use in order to optimize and run data engineering tasks.
- Switch out cron jobs: It’s quite hard to monitor cron jobs. However, instead of manually running ssh to figure out why they failed, you can see whether or not your code runs through a visual UI. When a cron job suffers failure, Airflow will send you a notification.
- Crawling data: you can also write and create tasks to crawl data over specific periods from the internet. This can then be used to write to your database. For instance, you can set periodic intervals to gain a better understanding of your competitor’s pricing or to download and store all comments on your Facebook page.
- Completely transform data: if you want to further process your data, you can also connect with external services using Airflow.
Steps for Running Airflow Locally
Now that you have a basic understanding of Airflow. In this section, you will learn the steps for Running Airflow Locally, there are a few prerequisites that you need to know about. For starters, you’re going to need some basic knowledge of Python, Bash, and Git.
More importantly, you’re going to need a decent understanding of some basic Airflow concepts. Then, there are the system requirements. Ideally, you would need Python 3 and pip, and the Heroku CLI for Running Airflow Locally. For Running Airflow Locally on your machine, here are the steps listed below:
Migrate Data seamlessly Within Minutes!
No credit card required
Step 1: Creating a Local Project
The first thing is to set up the local project. You need to open the terminal, and then establish the directory where your project is going to be set up. For instance, it’s best if you store it in the home directory, just for the sake of simplicity.
So, open up the terminal and add the following code in:
$ cd ~ # or ~/dev, ~/code, etc
$ mkdir localairflow
$ cd localairflow
- The command
cd ~
(or cd ~/dev
, cd ~/code
, etc.) changes the current directory to the user’s home directory or a specified subdirectory, which is where they want to work.
- The command
mkdir localairflow
creates a new directory named localairflow
within the current directory.
- The command
cd localairflow
navigates into the newly created localairflow
directory.
- These commands are typically used to set up a new workspace or project directory for organizing files related to a project, such as an Airflow setup.
- The sequence ensures that the user is in the correct location to start working on their project.
Step 2: Installing the Dependencies
Now, you have specified the location. And, the next thing we need is a virtual environment in Python for managing the project effectively, as well as any dependencies. There are several that you can use, such as Anaconda or Virtualenv. Or, if you have any other, that’s also an option.
To create an environment with Anaconda, use this:
Conda create -n airflow python=3.9
Conda activate airflow
Or, if you are using Virtualenv, use the following command:
$ pip install virtualenv
$ python3 -m venv venv
$ source venv/bin/activate
By the way, if you haven’t yet installed Airflow, you can do this with the following command:
pip install -U apache-airflow
Now, you need to create a home folder for Airflow and define it. To do this, just do the following:
Mkdir -p ~airflow/dags
Export home_airflow=’~/airflow’
Export PATH=$PATH:~/.local/bin
- The command
mkdir -p ~/airflow/dags
creates a directory structure for Airflow, including the dags
subdirectory, if it doesn’t already exist.
- The command
export home_airflow='~/airflow'
sets an environment variable named home_airflow
to point to the Airflow home directory.
- The command
export PATH=$PATH:~/.local/bin
appends the ~/.local/bin
directory to the system’s PATH
, allowing the user to run scripts or executables located there from any location in the terminal.
- These commands are typically used in a setup process for configuring an environment for Apache Airflow, making it easier to manage workflows.
- Using environment variables helps streamline references to directories and ensures that necessary binaries are accessible in the command line.
Once you run this code, a folder will be created, with the name “home_airflow”.
Step 3: Running Airflow Locally
Now, you need to init Airflow, and then you will have the option of tweaking the configuration file as you see fit. If you want to init Airflow, here’s what you need to do:
cd ~/airflow
airflow initdb
If you want to set an environment variable to define where Airflow stores elements like the airflow.cfg file for Running Airflow Locally, the log files, or other things, you can also set it as a subroot of the main project directory folder. Here’s how to do that:
$ export home_airflow=~/airflow/airflow_subdir
Now that you have everything set up, you are ready for Running Airflow Locally. You can also execute the webserver in a new terminal window. The following commands are given below.
# Activate airflow env if needed
conda activate airflow
airflow webserver
Now you are Running Airflow Locally on localhost:8080.
You will have the option to trigger a few of the DAGs for Running Airflow Locally. For those who don’t know, DAG simply stands for directed acrylic graphs. Essentially, it just describes graphs that are indirect and don’t form cycles.
When you are Running Airflow Locally without the schedular, unpausing any DAGs will not get any of the tasks to run. But, as shown above, after already launching a scheduler, that shouldn’t be a problem.
That’s it! You have completed installation for Running Airflow Locally.
Benefits of Running Airflow Locally
- Quick Setup and Testing: Easily test workflows without relying on external infrastructure.
- Cost-Effective: No cloud or server costs while developing and experimenting.
- Faster Debugging: Troubleshoot and refine workflows directly in your local environment.
- Customization Flexibility: Modify configurations and plugins without restrictions.
- Offline Accessibility: Work on your workflows even without an active internet connection.
Integrate Aftership to BigQuery
Integrate DynamoDB to Databricks
Integrate Kafka to PostgreSQL
Conclusion
In this section, you read about Apache Airflow, its key features, and why it is used for workflow management of Data Pipelines. You also learnt the steps for Running Airflow Locally on the localhost. When you first set up Airflow, it starts using an SQLite Database. However, since it doesn’t allow for parallelization, you may outgrow the database relatively quickly. If you want, you can always connect with another Database. While Airflow is great for creating pipelines, it’s generally not a suitable solution for non-technical teams.
Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 150+ data sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.
Sign up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.
FAQ on Running Airflow Locally
1. Can I use Airflow locally?
Yes, you can install and run Apache Airflow on your local machine using pip. This setup allows you to develop and test workflows locally before deploying them to a production environment.
2. Can you run Airflow without Docker?
Yes, You can run Airflow without Docker by installing it directly on your system using pip. This requires setting up the necessary dependencies and configuring a database for Airflow to use.
3. Can I use Airflow for free?
Yes, Apache Airflow is an open-source project and is free to use. You can download, install, and run it without any licensing costs.
4. How do you run an Airflow container?
Run an Airflow container by pulling the official Apache Airflow Docker image and executing it with docker run. For a more complex, multi-container setup, including the web server, scheduler, and database, use Docker Compose.
5. Does Airflow need a database?
Airflow requires a database to store metadata about DAG runs, task instances, and configurations. Supported databases include SQLite (for testing), MySQL, and PostgreSQL.
Najam specializes in leveraging data analytics to provide deep insights and solutions. With over eight years of experience in the data industry, he brings a profound understanding of data integration and analysis to every piece of content he creates.