Airflow is a community-built platform for authoring, scheduling, and monitoring processes programmatically. It’s essentially a more capable version of cron in that it can not only schedule jobs, but also perform processes in parallel, manage/monitor individual tasks, and interface with other platforms/tools like Google Cloud and StatsD.
It’s rapidly gaining traction in data engineering and ETL workflow coordination. In a nutshell, it helps to automate scripts in order to complete tasks. Docker is a containerization technology that encapsulates your application and all of its dependencies in a docker container, ensuring that your program runs smoothly in any environment.
Running Airflow in Docker is much easier compared to running it on Windows without Docker. It is because Docker saves up time needed for installing necessary dependencies which are required for running data pipelines.
In this tutorial article, you will understand the process of running Airflow in Docker with a detailed explanation. Before diving deeper into the process you will first have to understand the Airflow and Docker separately.
Table of Contents
- What is Docker?
- What is Apache Airflow?
- Running Airflow in Docker
What is Docker?
Docker is a popular open-source platform that allows software programs to operate in a portable and uniform environment. Docker uses containers to create segregated user-space environments that share file and system resources at the operating system level. Containerization uses a fraction of the resources of a typical server or virtual machine.
Main Advantages of Docker
- Portability: You can guarantee your apps’ functionality can be run in any environment by using Docker. This advantage comes from the fact that all programs and their dependencies are stored in the Docker execution container.
- Fast Deployment: Docker can reduce deployment time to seconds. This is because it spawns a container for each process and does not start an operating system.
- Scalability: Docker scales faster and more reliably than virtual machines (as well as traditional servers, which lack a considerable degree of scalability of any kind). Docker’s scalability is critical if you’re a company that wants to handle tens of thousands or hundreds of thousands of users with your apps.
- Isolation: Any supporting software that your application requires is also included in a Docker container that hosts one of your apps. It’s not an issue if other Docker containers include apps that require different versions of the same supporting software because the Docker containers are completely self-contained.
- Performance: Containers make it possible to allocate a host server’s limited resources more efficiently. Indirectly, this translates to greater performance for containerized programs, especially as the demand on the server increases and resource distribution optimization becomes more critical.
Simplify Data Analysis with Hevo’s No-code Data Pipeline
A fully managed No-code Data Pipeline platform like Hevo helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice like in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line.GET STARTED WITH HEVO FOR FREE
Check Out Some of the Cool Features of Hevo:
- Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
- Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
- Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up.
- Connectors: Hevo supports 100+ Integrations from sources to SaaS platforms, files, databases, analytics, and BI tools. It supports various destinations including Amazon Redshift, Firebolt, Snowflake Data Warehouses; Databricks, Amazon S3 Data Lakes, SQL Server, TokuDB, DynamoDB databases to name a few.
- 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources, that can help you scale your data infrastructure as required.
- 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Simplify your Data Analysis with Hevo today!SIGN UP HERE FOR A 14-DAY FREE TRIAL!
What is Apache Airflow?
Apache Airflow is a data pipeline management system developed by Airbnb. In 2014, Airbnb launched it as an open-source project to aid in the management of the company’s batch data pipelines.
It has since grown in popularity as one of the most widely used open-source workflow management tools in data engineering. Because Apache Airflow is developed in Python, it has a lot of flexibility and reliability. Workflow management tasks like job tracking and platform configuration are easier because of their intuitive and powerful user interface. Users may write any code they wish to run at each phase of the process because it depends on code to construct workflows.
Airflow can be used for almost any batch data pipeline, and there are several documented use cases, the most popular of which is Big Data projects. From Airflow’s Github repository, some of the most commonly used use cases are:
- Creating a Data Studio dashboard using Airflow and Google BigQuery.
- Airflow is being used to help construct and manage a data lake on AWS.
- Airflow is being used to help with the upgrading of production while reducing downtime.
How is it different compared to Cron?
Because of a number of factors, Apache Airflow has replaced the Cron:
- Building a relationship between jobs in cron is a pain, whereas it’s as simple as writing Python code in Airflow.
- Cron needs outside assistance to log, track, and handle tasks. Airflow offers a user interface for tracking and monitoring workflow execution.
- Cron tasks cannot be repeated unless they are explicitly specified. The Airflow keeps track of all jobs completed.
- Another distinction is that Airflow is easily extendable, whereas Cron is not.
Running Airflow in Docker
Running Airflow in Docker requires some knowledge of Airflow concepts in order to complete this tutorial. Here are some key terms that are used in Airflow.
DAG-Directed Acyclic Graph
Workflows are represented by Directed Acyclic Graphs which are essentially the tasks along with their dependencies to run. The tasks are represented by vertices, whereas the dependencies are represented by edges. The reason it is called acyclic is that you need to have an end of the workflow. Airflow has a Python class for creating DAGs, you will just need to instantiate an object from airflow.models.dag.DAG.
You have seen so far that DAGs show the workflow for running Airflow in Docker. What about the tasks? Here operators come for help so basically, operators define the tasks to be executed. There is a wide range of operators available through Airflow, including
Airflow has a custom operator, just in case you need it, allowing you to easily create, schedule, and monitor these tasks.
Now you know the fundamentals of Airflow and you can start running Airflow in Docker.
Docker Setup needs to be done carefully for running Airflow in Docker. Firstdocker-composestall Docker and Docker Compose. In this article, you are going to use puckel/docker-airflow repository for Docker’s automated build. Once you have Docker’s automated build, it becomes easier to run Airflow in Docker. If you want to get more information about this repo you can check from Puckel. So you will use this pre-made container for running Airflow in Docker DAGs. To obtain this Docker image, you must run the following command:
docker pull puckel/docker-airflow
As you used the puckel ready container you don’t need to create a docker-compose file by yourselves. Basically, Docker Compose helps you run multiple containers and you need a YAML file to configure your application’s services with Docker Compose for running Airflow in Docker. For example in this case docker-compose-CeleryExecutor.yml file contains configurations of the webserver, scheduler, worker, etc. So, using the following command, you can now run the container:
docker run -d -p 8080:8080 puckel/docker-airflow webserver
Puckel will use SequentialExecutor by default if you don’t specify an executor type. You need to use other compose files for other executors for example:
docker-compose -f docker-compose-CeleryExecutor.yml up -d
You can also start the docker compose with some example DAGs:
docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow
After running the container the Airflow will start to run on your localhost. You can check it by simply going to http://localhost:8080/admin/:
Create your own DAG
You have already seen the example DAGs in your localhost. DAGs are very important for running Airflow in Docker. DAG must be defined in a python file, which has several components, including the DAG definition, operators, and their relationships. After creating a file you need to add it to the DAG folder in the Airflow directory. If you don’t find the DAG folder you can check it from the airflow.cfg file which is located in the AirflowHome folder. You can make a simple DAG that will schedule a task to print some sample text every day at 8:00 a.m.
from datetime import datetime from airflow import DAG from airflow.operators.python_operator import PythonOperator def print_firstdag(): return 'My First DAG from HevoData!' dag = DAG('first_dag', description='HevoData Dag', schedule_interval='0 8 * * *', start_date=datetime(2022, 2, 24), catchup=False) print_operator = PythonOperator(task_id='first_task', python_callable=print_firstdag, dag=dag) print_operator
After creating DAG you just need to run your DAG on a Docker container. For this, you need to link your local machine with the container. You will give the path of DAGs in your directory. For the sake of simplicity, you will write the default path which is home/user/airflow/dags for running Airflow in Docker.
docker run -d -p 8080:8080 -v /home/user/airflow/dags:/usr/local/airflow/dags puckel/docker-airflow webserver
Now you have linked your container and local machine but you don’t still know the name of the container to run. Don’t worry it is just a simple command:
After taking the name you need to substitute in the command below:
docker exec -ti <container name> bash
It basically starts the command line in your docker container:
As you can see by default it is created in a pause state and you will make it active in order to run. You’ll be using UI this time because it’s more convenient in running Airflow in Docker. (shown with arrow):
That’s all you have scheduled. You can also run it by Triggering it from the UI. This concludes the steps involved in running Airflow in Docker seamlessly.
To sum up, all you can say Running Airflow in Docker alleviates you from the burden of managing, maintaining, and deploying all of the Airflow dependencies. In order for running Airflow in Docker, you need to download Docker and Docker compose then start your container after that you can create your own DAG and schedule the tasks or trigger it. Now you can create your own DAGs and run them in Docker.
Extracting complex data from a diverse set of data sources to carry out an insightful analysis can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources including Databases or SaaS applications into a destination of your choice or a Data Warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.VISIT OUR WEBSITE TO EXPLORE HEVO
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.