In this blog, you’ll focus more on the Implementation of Apache Airflow Parallelism. If you’re new to Apache Airflow, the world of Executors can be confusing. Even if you’re a seasoned user in charge of 20+ DAGs, knowing which Executor is best for your use case at any given time isn’t easy – especially as the OSS project (and its utilities) evolves.
This guide will accomplish three goals: Executors should be contextualized with general Apache Airflow fundamentals. And giving some information about the two most popular Executors: Celery, Kubernetes, and finally the Apache Airflow Parallelism.
Table of Contents
Introduction to Apache Airflow
Apache Airflow is an open-source Batch-Oriented pipeline-building framework for developing and monitoring data workflows. Airbnb founded Apache Airflow in 2014 to address big data and complex Data Pipeline issues. Using a built-in web interface, they wrote and scheduled processes as well as monitored workflow execution. Because of its growing popularity, the Apache Software Foundation adopted the Apache Airflow project.
By leveraging some standard Python framework features, such as data time format for task scheduling, Apache Airflow enables users to efficiently build scheduled Data Pipelines. It also includes a slew of building blocks that enable users to connect the various technologies found in today’s technological landscapes.
Another useful feature of Apache Airflow is its Backfilling Capability, which allows users to easily reprocess previously processed data. This feature can also be used to recompute any dataset after modifying the code.
Key Features of Apache Airflow
- Dynamic: Airflow pipelines are written in Python and can be generated dynamically. This allows for the development of code that dynamically instantiates pipelines.
- Extensible: You can easily define your operators and executors, and you can extend the library to fit the level of abstraction that works best for your environment.
- Elegant: Short and to the point, airflow pipelines. The powerful Jinja templating engine, which is built into the core of Airflow, is used to parameterize your scripts.
- Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. The airflow is ready to continue expanding indefinitely.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 30+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
GET STARTED WITH HEVO FOR FREE[/hevoButton]
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out why Hevo is the Best:
SIGN UP HERE FOR A 14-DAY FREE TRIAL
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Apache Airflow Parallelism
Numerous parameters influence the performance of Apache Airflow. Tuning these settings can affect DAG parsing and Task Scheduling Performance, Apache Airflow Parallelism in your Airflow environment, and other factors. Here are a few
Apache Airflow Parallelism has so many knobs at different levels because, as an agnostic orchestrator, it is used for a wide range of use cases. Apache Airflow administrators or DevOps engineers may adjust scaling parameters at the environment level to ensure that their supporting infrastructure is not overloaded, whereas DAG authors may adjust scaling parameters at the DAG or task level to ensure that their pipelines do not overwhelm external systems.
Configuration of the Environment
Environment-level settings affect your entire Airflow environment (all DAGs). They all have default values that can be overridden by modifying your airflow.cfg file or setting the appropriate environment variable. In general, the Apache Airflow Parallelism contains all of the default values. In the Apache Airflow UI, navigate to Admin > Configurations to view the current values for an existing Apache Airflow environment.
Airflow Parallelism: Core Settings
The number of processes that can run concurrently and for how long are controlled by the core settings. The environment variables associated with all parameters in this section are formatted as AIRFLOW_CORE_PARAMETER_NAME.
- Parallelism: This is the maximum number of tasks that can run at the same time in a single Airflow environment. If this setting is set to 32, for example, no more than 32 tasks can run concurrently across all DAGs. Consider this “maximum active tasks anywhere” If you notice that tasks are being held in the queue for extended periods, this is a value you should consider increasing. This is set to 32 by default.
- max_active_tasks_per_dag: This setting (formerly dag_concurrency) determines the maximum number of tasks that can be scheduled at once, per DAG.
- Use this setting to prevent anyone DAG from taking up too many of the available slots from parallelism or your pools, which helps DAGs be good neighbors to one another. It is set to 16 by default.
- If you increase the number of resources available to Airflow (such as Celery workers or Kubernetes resources) and find that tasks are still not running as expected, you may need to increase both parallelism and max_active_tasks_per _dag.
- max_active_runs_per_dag: The maximum number of active DAG Runs (per DAG) that the Airflow Scheduler can create at any given time is determined by this setting. A DAG Run in Airflow represents an instantiation of a DAG in time, similar to how a task instance represents an instantiation of a task.
- This parameter is most important when Airflow needs to catch up on missed DAG runs, also known as backfilling. When configuring this parameter, think about how you want to handle these scenarios. It is set to 16 by default.
- dag_file_processor_timeout: The default is 50 seconds. This is the maximum amount of time a DagFileProcessor, which processes a DAG file, can run before it times out.
- dagbag_import_timeout: This is the time in seconds that the dagbag can import DAG objects before timing out, which must be less than the value set for dag file processor timeout. If your DAG processing logs show timeouts, or if your DAG does not appear in the list of DAGs or the import errors, try increasing this value. You can also try increasing this value if your tasks aren’t executing because workers are required to fill the dagbag when tasks execute. It is set to 30 seconds by default.
Airflow Parallelism: Configuration of the Scheduler
The scheduler settings determine how the scheduler parses DAG files and generates DAG runs. For all parameters in this section, the associated environment variables are formatted as AIRFLOW_SCHEDULER_PARAMETER_NAME.
- min_file_process_interval: Every min file process interval seconds, each DAG file is parsed. After this interval, DAG updates are reflected. A low value here will increase the CPU usage of the scheduler. If you have dynamic DAGs that were created by complex code, you should increase this value to avoid negative scheduler performance impacts. It is set to 30 seconds by default.
- dag_dir_list_interval: In seconds, this is how frequently the DAGs directory is scanned for new files. The lower the value, the faster new DAGs are processed, but the greater your CPU usage. This is set to 300 seconds by default (5 minutes).
- parsing_processes: (previously max threads) To parse DAGs, the scheduler can run multiple processes in parallel. This option specifies how many processes can run concurrently. Setting a value of 2x your available vCPUs is recommended. If you have a large number of DAGs, increasing this value can help you serialize them more efficiently.
- Please keep in mind that if you have multiple schedulers running, this value will apply to each of them. This value is set to 2 by default.
- file_parsing_sort_mode: This specifies how the scheduler will list and sort DAG files to determine the parsing order. Set one of the following values: modified time, randomly seeded by the host, or alphabetical. The modified time setting is the default.
- scheduler_heartbeat_sec: This option specifies how frequently the scheduler should run (in seconds) to initiate new tasks.
- max_dagruns_to_create_per_loop: This is the maximum number of DAGs that can be created per scheduler loop. You can use this option to free up resources for task scheduling by lowering the value. The timer is set to 10 seconds by default.
- max_tis_per_query: The batch size of queries to the metastore in the main scheduling loop is changed by this parameter. If the value is greater, you can process more tis per query, but your query may become overly complex, causing a performance bottleneck. 512 queries are the default value.
- This option specifies how frequently the scheduler should run (in seconds) to initiate new tasks.
Scaling and Executors
There are a few more settings to consider when scaling your Apache Airflow Parallelism environment, depending on which executor you use.
Standing workers are used by the Celery Executor to complete tasks. Scaling with the Celery executor entails deciding on the number and size of workers available to Apache Airflow. The greater the number of workers available in your environment, or the larger the size of your workers, the greater your capacity to run tasks concurrently.
You can also tune your worker concurrency (environment variable: AIRFLOW_CELERY_WORKER_CONCURRENCY), which determines how many tasks each Celery worker can run at once. The Celery Executor will run a maximum of 16 tasks concurrently by default. If you increase worker concurrency, you may need to allocate more CPU and/or memory to your workers.
For each task, the Kubernetes Executor starts a pod in a Kubernetes cluster. Because each task runs in its own pod, resources can be specified at the task level.
When tuning performance with the Kubernetes Executor, keep your Kubernetes cluster’s supporting infrastructure in mind. Many users will enable auto-scaling on their cluster to take advantage of Kubernetes’ elasticity.
You can also adjust the worker pods creation batch size (environment variable: AIRFLOW_KUBERNETES_WORKER_PODS_CREATION_BATCH_SIZE), which controls how many pods can be created per scheduler loop. The default is 1, but most users will want to increase this number for better performance, especially if you have multiple tasks running at the same time. The maximum value you can set is determined by the tolerance of your Kubernetes cluster.
This guide has bought you to accomplish three goals: Executors should be contextualized with general Apache Airflow fundamentals. And gave some information about the two most popular Executors: Celery, Kubernetes, and finally the Apache Airflow Parallelism.
Apache Airflow’s rich web interface allows you to easily monitor pipeline run results and debug any failures that occur. Because of its dynamic nature and flexibility, Apache Airflow has benefited many businesses today.
Visit our Website to Explore Hevo
Companies need to analyze their business data stored in multiple data sources. The data needs to be loaded to the Data Warehouse to get a holistic view of the data. Hevo Data is a No-code Data Pipeline solution that helps to transfer data from 100+ sources to desired Data Warehouse. It fully automates the process of transforming and transferring data to a destination without writing a single line of code.
Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience on learning the Apache Airflow Parallelism: A Comprehensive Guide in the comments section below!