Airflow is a community-built platform for authoring, scheduling, and monitoring processes programmatically. It’s essentially a more capable version of cron in that it can not only schedule jobs, but also perform processes in parallel, manage/monitor individual tasks, and interface with other platforms/tools like Google Cloud and StatsD.
It’s rapidly gaining traction in data engineering and ETL workflow coordination. In a nutshell, it helps to automate scripts in order to complete tasks. Docker is a containerization technology that encapsulates your application and all of its dependencies in a docker container, ensuring that your program runs smoothly in any environment.
Running Airflow in Docker is much easier compared to running it on Windows without Docker. It is because Docker saves up time needed for installing necessary dependencies which are required for running data pipelines.
In this tutorial article, you will understand the process of running Airflow in Docker with a detailed explanation. Before diving deeper into the process you will first have to understand the Airflow and Docker separately.
What is Docker?
Docker is a popular open-source platform that allows software programs to operate in a portable and uniform environment. Docker uses containers to create segregated user-space environments that share file and system resources at the operating system level. Containerization uses a fraction of the resources of a typical server or virtual machine.
Main Advantages of Docker
- Portability: You can guarantee your apps’ functionality can be run in any environment by using Docker. This advantage comes from the fact that all programs and their dependencies are stored in the Docker execution container.
- Fast Deployment: Docker can reduce deployment time to seconds. This is because it spawns a container for each process and does not start an operating system.
- Scalability: Docker scales faster and more reliably than virtual machines (as well as traditional servers, which lack a considerable degree of scalability of any kind). Docker’s scalability is critical if you’re a company that wants to handle tens of thousands or hundreds of thousands of users with your apps.
- Isolation: Any supporting software that your application requires is also included in a Docker container that hosts one of your apps. It’s not an issue if other Docker containers include apps that require different versions of the same supporting software because the Docker containers are completely self-contained.
- Performance: Containers make it possible to allocate a host server’s limited resources more efficiently. Indirectly, this translates to greater performance for containerized programs, especially as the demand on the server increases and resource distribution optimization becomes more critical.
Looking for the best ETL tools to connect your data sources? Rest assured, Hevo’s no-code platform helps streamline your ETL process. Try Hevo and equip your team to:
- Integrate data from 150+ sources(60+ free sources).
- Simplify data mapping with an intuitive, user-friendly interface.
- Instantly load and sync your transformed data into your desired destination.
Choose Hevo for a seamless experience and know why Industry leaders like Meesho say- “Bringing in Hevo was a boon.”
Get Started with Hevo for Free
What is Apache Airflow?
Apache Airflow is a data pipeline management system developed by Airbnb. In 2014, Airbnb launched it as an open-source project to aid in the management of the company’s batch data pipelines.
It has since grown in popularity as one of the most widely used open-source workflow management tools in data engineering. Because Apache Airflow is developed in Python, it has a lot of flexibility and reliability. Workflow management tasks like job tracking and platform configuration are easier because of their intuitive and powerful user interface. Users may write any code they wish to run at each phase of the process because it depends on code to construct workflows.
Airflow can be used for almost any batch data pipeline, and there are several documented use cases, the most popular of which is Big Data projects. From Airflow’s Github repository, some of the most commonly used use cases are:
- Creating a Data Studio dashboard using Airflow and Google BigQuery.
- Airflow is being used to help construct and manage a data lake on AWS.
- Airflow is being used to help with the upgrading of production while reducing downtime.
How is it different compared to Cron?
Because of a number of factors, Apache Airflow has replaced the Cron:
- Building a relationship between jobs in cron is a pain, whereas it’s as simple as writing Python code in Airflow.
- Cron needs outside assistance to log, track, and handle tasks. Airflow offers a user interface for tracking and monitoring workflow execution.
- Cron tasks cannot be repeated unless they are explicitly specified. The Airflow keeps track of all jobs completed.
- Another distinction is that Airflow is easily extendable, whereas Cron is not.
Running Airflow in Docker
Running Airflow in Docker requires some knowledge of Airflow concepts in order to complete this tutorial. Here are some key terms that are used in Airflow.
DAG-Directed Acyclic Graph
Workflows are represented by Directed Acyclic Graphs which are essentially the tasks along with their dependencies to run. The tasks are represented by vertices, whereas the dependencies are represented by edges. The reason it is called acyclic is that you need to have an end of the workflow. Airflow has a Python class for creating DAGs, you will just need to instantiate an object from airflow.models.dag.DAG.
Operators
You have seen so far that DAGs show the workflow for running Airflow in Docker. What about the tasks? Here operators come for help so basically, operators define the tasks to be executed. There is a wide range of operators available through Airflow, including
- PythonOperator
- EmailOperator
- JdbcOperator
- OracleOperator
Airflow has a custom operator, just in case you need it, allowing you to easily create, schedule, and monitor these tasks.
Now you know the fundamentals of Airflow and you can start running Airflow in Docker.
Docker Setup
Docker Setup needs to be done carefully for running Airflow in Docker. Firstdocker-composestall Docker and Docker Compose. In this article, you are going to use puckel/docker-airflow repository for Docker’s automated build. Once you have Docker’s automated build, it becomes easier to run Airflow in Docker. If you want to get more information about this repo you can check from Puckel. So you will use this pre-made container for running Airflow in Docker DAGs. To obtain this Docker image, you must run the following command:
docker pull puckel/docker-airflow
As you used the puckel ready container you don’t need to create a docker-compose file by yourselves. Basically, Docker Compose helps you run multiple containers and you need a YAML file to configure your application’s services with Docker Compose for running Airflow in Docker. For example in this case docker-compose-CeleryExecutor.yml file contains configurations of the webserver, scheduler, worker, etc. So, using the following command, you can now run the container:
docker run -d -p 8080:8080 puckel/docker-airflow webserver
Puckel will use SequentialExecutor by default if you don’t specify an executor type. You need to use other compose files for other executors for example:
docker-compose -f docker-compose-CeleryExecutor.yml up -d
You can also start the docker compose with some example DAGs:
docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow
After running the container the Airflow will start to run on your localhost. You can check it by simply going to http://localhost:8080/admin/:
Create your own DAG
You have already seen the example DAGs in your localhost. DAGs are very important for running Airflow in Docker. DAG must be defined in a python file, which has several components, including the DAG definition, operators, and their relationships. After creating a file you need to add it to the DAG folder in the Airflow directory. If you don’t find the DAG folder you can check it from the airflow.cfg file which is located in the AirflowHome folder. You can make a simple DAG that will schedule a task to print some sample text every day at 8:00 a.m.
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def print_firstdag():
return 'My First DAG from HevoData!'
dag = DAG('first_dag', description='HevoData Dag',
schedule_interval='0 8 * * *',
start_date=datetime(2022, 2, 24), catchup=False)
print_operator = PythonOperator(task_id='first_task', python_callable=print_firstdag, dag=dag)
print_operator
After creating DAG you just need to run your DAG on a Docker container. For this, you need to link your local machine with the container. You will give the path of DAGs in your directory. For the sake of simplicity, you will write the default path which is home/user/airflow/dags for running Airflow in Docker.
docker run -d -p 8080:8080 -v /home/user/airflow/dags:/usr/local/airflow/dags puckel/docker-airflow webserver
Now you have linked your container and local machine but you don’t still know the name of the container to run. Don’t worry it is just a simple command:
docker ps
After taking the name you need to substitute in the command below:
docker exec -ti <container name> bash
It basically starts the command line in your docker container:
As you can see by default it is created in a pause state and you will make it active in order to run. You’ll be using UI this time because it’s more convenient in running Airflow in Docker. (shown with arrow):
That’s all you have scheduled. You can also run it by Triggering it from the UI. This concludes the steps involved in running Airflow in Docker seamlessly.
Automate your Data from Kafka to PostgreSQL
Connect your Data from Kafka to Snowflake
Migrate your Data from REST API to BigQuery
Real-life Use Cases of Running Airflow in Docker
- Data Pipelines for Analytics: Running Airflow in Docker helps organizations set up and manage scalable data pipelines that automate the process of extracting, transforming, and loading data for analytics.
- Development and Testing Environments: Developers use Docker to quickly spin up isolated Airflow environments for testing and debugging without affecting their production systems.
- Continuous Integration and Deployment (CI/CD): Airflow in Docker is commonly used to automate deployment pipelines, ensuring that code is automatically tested, built, and deployed in a consistent and reproducible environment.
- Multi-Environment Setup: Companies use Docker to run multiple Airflow instances in different environments (development, staging, production) with ease, all using the same underlying infrastructure.
- Cloud Migration: Businesses migrating to the cloud can run Airflow in Docker containers, enabling a consistent and portable setup across on-premises and cloud environments.
Conclusion
To sum up, all you can say Running Airflow in Docker alleviates you from the burden of managing, maintaining, and deploying all of the Airflow dependencies. In order for running Airflow in Docker, you need to download Docker and Docker compose then start your container after that you can create your own DAG and schedule the tasks or trigger it. Now you can create your own DAGs and run them in Docker.
Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 150+ sources (including 60+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin?
Sign up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.
Frequently Asked Questions
1. How to run a Docker container in Airflow?
You can run a Docker container in Airflow by using the DockerOperator, which allows you to execute Docker containers as part of your Airflow tasks. Simply specify the image and other container settings in your Airflow DAG.
2. Can you run Airflow locally?
Yes, you can run Airflow locally by setting up a local Airflow instance on your computer. You can use tools like Docker or install it directly to run and test your workflows on your local machine.
3. Does Airflow run on Linux?
Yes, Airflow runs on Linux. It’s compatible with Linux distributions and is commonly deployed on Linux servers or virtual machines for production use.
Subhan Hagverdiyev is an expert in data integration and analytics, with a robust understanding of data processes and advanced analytical methods. His extensive hands-on experience allows him to tackle complex data challenges head-on, delivering actionable insights that drive critical business decisions. Subhan simplifies intricate data systems, ensuring clients benefit from optimized support and innovative, data-driven solutions.