Airflow Parallelism 101: A Comprehensive Guide

In this blog, you’ll focus more on the Implementation of Apache Airflow Parallelism. If you’re new to Apache Airflow, the world of Executors can be confusing. Even if you’re a seasoned user in charge of 20+ DAGs, knowing which Executor is best for your use case at any given time isn’t easy – especially as the OSS project (and its utilities) evolves.

This guide will accomplish three goals: Executors should be contextualized with general Apache Airflow fundamentals. And giving some information about the two most popular Executors: Celery, Kubernetes, and finally the Apache Airflow Parallelism.

Table of Contents

Introduction to Apache Airflow

Apache Airflow is an open-source Batch-Oriented pipeline-building framework for developing and monitoring data workflows. Airbnb founded Apache Airflow in 2014 to address big data and complex Data Pipeline issues. Using a built-in web interface, they wrote and scheduled processes as well as monitored workflow execution. Because of its growing popularity, the Apache Software Foundation adopted the Apache Airflow project.

By leveraging some standard Python framework features, such as data time format for task scheduling, Apache Airflow enables users to efficiently build scheduled Data Pipelines. It also includes a slew of building blocks that enable users to connect the various technologies found in today’s technological landscapes.

Another useful feature of Apache Airflow is its Backfilling Capability, which allows users to easily reprocess previously processed data. This feature can also be used to recompute any dataset after modifying the code.

Key Features of Apache Airflow

Dynamic: Airflow pipelines are written in Python and can be generated dynamically. This allows for the development of code that dynamically instantiates pipelines.
Extensible: You can easily define your operators and executors, and you can extend the library to fit the level of abstraction that works best for your environment.
Elegant: Short and to the point, airflow pipelines. The powerful Jinja templating engine, which is built into the core of Airflow, is used to parameterize your scripts.
Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. The airflow is ready to continue expanding indefinitely.

Tired of dealing with Airflow’s complex DAG management and parallelism issues? Simplify your data pipeline with Hevo’s automated ETL platform. Hevo handles concurrent data processing effortlessly, freeing you from the hassle of managing parallel workflows.

Why Consider Hevo

No-Code Setup: Easily configure data pipelines without writing any code.
Real-Time Sync: Enjoy continuous data updates for accurate analysis.
Reliable & Scalable: Handle growing data volumes with Hevo’s robust infrastructure.

Experience smooth scalability and streamlined data operations—no Airflow headaches required.

Get Started with Hevo for Free

Apache Airflow Parallelism

Numerous parameters influence the performance of Apache Airflow. Tuning these settings can affect DAG parsing and Task Scheduling Performance, Apache Airflow Parallelism in your Airflow environment, and other factors. Here are a few

Configuration of the Environment
- Airflow Parallelism: Core Settings
- Airflow Parallelism: Configuration of the Scheduler

Apache Airflow Parallelism has so many knobs at different levels because, as an agnostic orchestrator, it is used for a wide range of use cases. Apache Airflow administrators or DevOps engineers may adjust scaling parameters at the environment level to ensure that their supporting infrastructure is not overloaded, whereas DAG authors may adjust scaling parameters at the DAG or task level to ensure that their pipelines do not overwhelm external systems.

Configuration of the Environment

Environment-level settings affect your entire Airflow environment (all DAGs). They all have default values that can be overridden by modifying your airflow.cfg file or setting the appropriate environment variable. In general, the Apache Airflow Parallelism contains all of the default values. In the Apache Airflow UI, navigate to Admin > Configurations to view the current values for an existing Apache Airflow environment.

Airflow Parallelism: Core Settings

The number of processes that can run concurrently and for how long are controlled by the core settings. The environment variables associated with all parameters in this section are formatted as AIRFLOW_CORE_PARAMETER_NAME.

Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs. Consider this “maximum active tasks anywhere” If you notice that tasks are being held in the queue for extended periods, this is a value you should consider increasing. This is set to 32 by default.
max_active_tasks_per_dag: This setting (formerly dag_concurrency) determines the maximum number of tasks that can be scheduled at once, per DAG.
- Use this setting to prevent anyone DAG from taking up too many of the available slots from parallelism or your pools, which helps DAGs be good neighbors to one another. It is set to 16 by default.
- If you increase the number of resources available to Airflow (such as Celery workers or Kubernetes resources) and find that tasks are still not running as expected, you may need to increase both parallelism and max_active_tasks_per _dag.
max_active_runs_per_dag: The maximum number of active DAG Runs (per DAG) that the Airflow Scheduler can create at any given time is determined by this setting. A DAG Run in Airflow represents an instantiation of a DAG in time, similar to how a task instance represents an instantiation of a task.
- This parameter is most important when Airflow needs to catch up on missed DAG runs, also known as backfilling. When configuring this parameter, think about how you want to handle these scenarios. It is set to 16 by default.
dag_file_processor_timeout: The default is 50 seconds. This is the maximum amount of time a DagFileProcessor, which processes a DAG file, can run before it times out.
dagbag_import_timeout: This is the time in seconds that the dagbag can import DAG objects before timing out, which must be less than the value set for dag file processor timeout. If your DAG processing logs show timeouts, or if your DAG does not appear in the list of DAGs or the import errors, try increasing this value. You can also try increasing this value if your tasks aren’t executing because workers are required to fill the dagbag when tasks execute. It is set to 30 seconds by default.

Airflow Parallelism: Configuration of the Scheduler

The scheduler settings determine how the scheduler parses DAG files and generates DAG runs. For all parameters in this section, the associated environment variables are formatted as AIRFLOW_SCHEDULER_PARAMETER_NAME.

min_file_process_interval: Every min file process interval seconds, each DAG file is parsed. After this interval, DAG updates are reflected. A low value here will increase the CPU usage of the scheduler. If you have dynamic DAGs that were created by complex code, you should increase this value to avoid negative scheduler performance impacts. It is set to 30 seconds by default.
dag_dir_list_interval: In seconds, this is how frequently the DAGs directory is scanned for new files. The lower the value, the faster new DAGs are processed, but the greater your CPU usage. This is set to 300 seconds by default (5 minutes).
parsing_processes: (previously max threads) To parse DAGs, the scheduler can run multiple processes in parallel. This option specifies how many processes can run concurrently. Setting a value of 2x your available vCPUs is recommended. If you have a large number of DAGs, increasing this value can help you serialize them more efficiently.
- Please keep in mind that if you have multiple schedulers running, this value will apply to each of them. This value is set to 2 by default.
file_parsing_sort_mode: This specifies how the scheduler will list and sort DAG files to determine the parsing order. Set one of the following values: modified time, randomly seeded by the host, or alphabetical. The modified time setting is the default.
scheduler_heartbeat_sec: This option specifies how frequently the scheduler should run (in seconds) to initiate new tasks.
max_dagruns_to_create_per_loop: This is the maximum number of DAGs that can be created per scheduler loop. You can use this option to free up resources for task scheduling by lowering the value. The timer is set to 10 seconds by default.
max_tis_per_query: The batch size of queries to the metastore in the main scheduling loop is changed by this parameter. If the value is greater, you can process more tis per query, but your query may become overly complex, causing a performance bottleneck. 512 queries are the default value.
- This option specifies how frequently the scheduler should run (in seconds) to initiate new tasks.

Integrate Kafka to Redshift

Get a Demo Try it

Integrate Kafka to BigQuery

Get a Demo Try it

Integrate Kafka to Snowflake

Get a Demo Try it

Scaling and Executors

There are a few more settings to consider when scaling your Apache Airflow Parallelism environment, depending on which executor you use.

Celery Executor
Kubernetes Executor

Celery Executor

Standing workers are used by the Celery Executor to complete tasks. Scaling with the Celery executor entails deciding on the number and size of workers available to Apache Airflow. The greater the number of workers available in your environment, or the larger the size of your workers, the greater your capacity to run tasks concurrently.

You can also tune your worker concurrency (environment variable: AIRFLOW_CELERY_WORKER_CONCURRENCY), which determines how many tasks each Celery worker can run at once. The Celery Executor will run a maximum of 16 tasks concurrently by default. If you increase worker concurrency, you may need to allocate more CPU and/or memory to your workers.

Kubernetes Executor

For each task, the Kubernetes Executor starts a pod in a Kubernetes cluster. Because each task runs in its own pod, resources can be specified at the task level.

When tuning performance with the Kubernetes Executor, keep your Kubernetes cluster’s supporting infrastructure in mind. Many users will enable auto-scaling on their cluster to take advantage of Kubernetes’ elasticity.

You can also adjust the worker pods creation batch size (environment variable: AIRFLOW_KUBERNETES_WORKER_PODS_CREATION_BATCH_SIZE), which controls how many pods can be created per scheduler loop. The default is 1, but most users will want to increase this number for better performance, especially if you have multiple tasks running at the same time. The maximum value you can set is determined by the tolerance of your Kubernetes cluster.

Interested in mastering the Airflow Kubernetes Operator? Read our comprehensive guide to see how this tool can optimize your data pipelines within a Kubernetes setup.

Conclusion

This guide has bought you to accomplish three goals: Executors should be contextualized with general Apache Airflow fundamentals. And gave some information about the two most popular Executors: Celery, Kubernetes, and finally the Apache Airflow Parallelism.

Apache Airflow’s rich web interface allows you to easily monitor pipeline run results and debug any failures that occur. Because of its dynamic nature and flexibility, Apache Airflow has benefited many businesses today.

Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 150+ sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.

Frequently Asked Questions

1. What is parallelism in Airflow?

Parallelism in Airflow refers to the ability to execute multiple tasks simultaneously within the same workflow (DAG) or across multiple workflows (DAGs).

2. How many DAGs can run in parallel in Airflow?

The number of DAGs that can run in parallel in Airflow is influenced by the global parallelism setting and the resources available (such as the number of worker nodes).

3. What is concurrency in Airflow?

Concurrency in Airflow refers to the maximum number of task instances that are allowed to run simultaneously.

Davor DSouza Research Analyst, Hevo Data

Davor DSouza is a data analyst with a passion for using data to solve real-world problems. His experience with data integration and infrastructure, combined with his Master's in Machine Learning, equips him to bridge the gap between theory and practical application. He enjoys diving deep into data and emerging with clear and actionable insights.

Introduction to Apache Airflow

Key Features of Apache Airflow

Apache Airflow Parallelism

Configuration of the Environment

Airflow Parallelism: Core Settings

Airflow Parallelism: Configuration of the Scheduler

Scaling and Executors

Celery Executor

Kubernetes Executor

Conclusion

Frequently Asked Questions

1. What is parallelism in Airflow?

2. How many DAGs can run in parallel in Airflow?

3. What is concurrency in Airflow?

Related Articles

Optimize your data integration with Hevo!

Related articles