Data Engineering Pipelines play a vital role in managing the flow of business data for companies. Organizations spend a significant amount of money on developing and managing Data Pipelines, so they can get streamlined data to accomplish their daily activities with ease. Many tools in the market help companies manage their messaging and task distributions for scalability.
Apache Airflow is a workflow management platform that helps companies orchestrate their Data Pipeline tasks and save time. As data and the number of backend tasks grow, the need to scale up and make resources available for all the tasks becomes a necessity. Airflow Celery Executor makes it easier for developers to build a scalable application by distributing the tasks on multiple machines. With the help of Airflow Celery, companies can send multiple messages for execution without any lag.
Apache Airflow Celery Executor is easy to use as it uses Python scripts and DAGs for scheduling, monitoring, and executing tasks. In this article, you will learn about Airflow Celery Executor and how it helps in scaling out and easily distributing tasks. Also, you will read about the architecture of the Airflow Celery Executor and how its process executes the tasks.
Table of Contents
- A brief overview of Apache Airflow.
Introduction to Apache Airflow
Apache Airflow is an open-source workflow management platform to programmatically author, schedule, and monitor workflows. In Airflow, workflows are the DAGs (Directed Acyclic Graphs) of tasks and are widely used to schedule and manage the Data Pipeline. Airflow is written in Python, and workflows are created in Python scripts. Airflow using Python allows Developers to make use of libraries in creating workflows.
The airflow workflow engine executes the complex Data Pipeline jobs with ease and ensures that each task executes on time in the correct order, and makes sure every task gets the required resources. Airflow can run DAGs on the defined schedule and based on the external event triggers.
Key Features of Apache Airflow
Some of the main features of Apache Airflow are listed below.
- Open Source: Airflow is free to use and has a lot of active users and an interactive community. Its resources are easily available over the web.
- Robust Integrations: Airflow comes with ready-to-use operators and many plug-and-play integrations that offer users to run tasks on Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
- Easy to Use: Airflow is written in Python which means, users can create workflows using Python scripts and import libraries that make the job easier.
- Interactive UI: Airflow also offers a web application that allows users to monitor the real-time status of tasks running and DAGs. Users can schedule and manage their workflows with ease.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 150+ data sources (including 30+ free data sources) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Get Started with Hevo for Free
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out why Hevo is the Best:
Sign up here for a 14-Day Free Trial!
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Introduction to the Airflow Celery Executor
Airflow Celery is a task queue that helps users scale and integrates with other languages. It comes with the tools and supports you need to run such a system in production.
Executors in Airflow are the mechanism by which users can run the task instances. Airflow comes with various executors, but the most widely used among those is Airflow Celery Executor used for scaling out by distributing the workload to multiple Celery workers that can run on different machines.
CeleryExecutor works with some workers it has to distribute the tasks with the help of messages. The deployment scheduler adds a message to the queue and the Airflow Celery broker delivers that message to the Airflow Celerty worker to execute the task. If due to any failure the assigned Airflow Celery worker on the job goes down then Airflow Celery quickly adapts to it and assigns the task to another worker.
Airflow Celery Executor Setup
To set up the Airflow Celery Executor, first, you need to set up an Airflow Celery backend using the message broker services such as RabbitMQ, Redis, etc. After that, you need to change the airflow.cfg file to point the executor parameters to CeleryExecutor and enter all the required configurations for it.
The prerequisites for setting up the Airflow Celery Executor Setup are listed below:
- Airflow was installed on the local machine and properly configured.
- Airflow configuration should be homogeneous across the cluster.
- Before executing any Operators, the workers need to have their dependencies met in that context by importing the Python library.
- The workers should have access to the DAGs directory – DAGS_FOLDER.
To start the Airflow Celery worker, use the following command given below.
airflow celery worker
As the service is started and ready to receive tasks. As soon as a task is triggered in its direction, it will start its job.
For stopping the worker, use the following command given below.
airflow celery stop
You can also use its GUI web application that allows users to monitor the workers. To install the flower Python library, use the following command given below in the terminal.
pip install ‘apache-airflow[celery]’
You can start the Flower web server by using the command given below.
airflow celery flower
Airflow Celery Executor Architecture
The architecture of the Airflow Celery Executor consists of several components, listed below:
- Workers: Its job is to execute the assigned tasks by Airflow Celery.
- Scheduler: It is responsible for adding the necessary tasks to the queue.
- Database: It contains all the information related to the status of tasks, DAGs, Variables, connections, etc.
- Web Server: The HTTP server provides access to information related to DAG and task status.
- Celery: Queue mechanism.
- Broker: This component of the Celery queue stores commands for execution.
- Result Backend: It stores the status of all completed commands.
Now that you have understood the different components of the architecture of Airflow Celery Executor. Now, let’s understand how these components communicate with each other during the execution of any task.
- Web server –> Workers: It fetches all the task execution logs.
- Web server –> DAG files: It reveals the DAG structure.
- Web server –> Database: It fetches the status of the tasks.
- Workers –> DAG files: It reveals the DAG structure and executes the tasks.
- Workers –> Database: Workers get and store information about connection configuration, variables, and XCOM.
- Workers –> Celery’s result backend: It saves the status of tasks.
- Workers –> Celery’s broker: It stores all the commands for execution.
- Scheduler –> DAG files: It reveals the DAG structure and executes the tasks.
- Scheduler –> Database: It stores information on DAG runs and related tasks.
- Scheduler –> Celery’s result backend: It gets information about the status of completed tasks.
- Scheduler –> Celery’s broker: It puts the commands to be executed.
Task Execution Process of Airflow Celery
In this section, you will learn how the tasks are executed by Airflow Celery Executor. Initially when the task is begin to execute, mainly two processes are running at that time.
- SchedulerProcess: This processes the tasks and runs with the help of Airflow CeleryExecutor.
- WorkerProcess: It overserves that is waiting for new tasks to append in the queue. It also has WorkerChildProcess that waits for new tasks.
Also, The two Databases are involved in the process while executing the task:
Along with the process, the 2 other processes are created. The processes are listed below:
- RawTaskProcess: It is the process with user code.
- LocalTaskJobProcess: The logic for this process is described by the LocalTaskJob. Its job is to monitor the status of RawTaskProcess. All the new processes are started from TaskRunner.
The process of how tasks are executed using the above
process and how the data and instructions flow are given below.
- SchedulerProcess processes the tasks and when it gets a task that needs to be executed, then it sends it to the QueueBroker to add the task to the queue.
- Simultaneously the QueueBroker starts querying the ResultBackend for the status of the task. And when the QueueBroker gets acquainted with the then it sends the information about it to one WorkerProcess.
- When WorkerProcess receives the information, it assigns a single task to one WorkerChildProcess.
- After receiving a task, WorkerChildProcess executes the handling function for the task i.e. execute_command(). This creates the new process LocalTaskJobProcess.
- LocalTaskJobProcess describes the logic for LocalTaskJobProcess. A new process is started using TaskRunner.
- Both the process RawTaskProcess and LocalTaskJobProcess are stopped when they finish their work.
- Then, WorkerChildProcess notifies this information about the completion of a task and the availability of subsequent tasks to the main process WorkerProcess.
- WorkerProcess then saves the status information back into the ResultBackend.
- Now, whenever SchedulerProcess requests for the status from ResultBackend, it receives the status information of the task.
In this article, you learned about Apache Airflow, Airflow Celery Executor, and how it helps companies achieve scalability. Also, you read about the architecture of Airflow Celery Executor and its task execution process. Airflow Celery uses message brokers to receive tasks and distributes those on multiple machines for higher performance.
Visit our Website to Explore Hevo
Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 150+ sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.
Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of learning about Airflow Celery Executor in the comments section below!