Apache airflow is a popular workflow management tool that provides functionalities like organizing, scheduling, and generating workflows. Apache Airflow is an open-source tool that makes it free to use. Open source also allows it to have good support. This platform is sussed to orchestrate workflows and it is one of the most popular tools that is widely used and recommended by data engineers.
This tool provides functionality like visualizing the workflows and data pipelines, Scheduling Airflow Jobs, getting the current status of workflows, the data logs related to workflow, and all the codes of workflows in detail. Airflow is a highly scalable solution and also a distributed one. It can also be connected to various sources making it flexible. These features help in solving complex workflow and data pipeline problems efficiently.
One of the main features is to Schedule Airflow jobs. Airflow provides functions with parameters, that can be altered based on the requirement. This article tries to give a guide on using the scheduler to schedule airflow jobs.
What is Airflow?
Apache Airflow is a workflow management engine that specializes in executing data pipelines that are complex-natured. Its main duty is to ensure that all the steps of the workflow are executed in the order that can be decided before execution. It also ensures that all the tasks get the required resources for efficient execution.
Apache Airflow is an open-source, free-to-use, platform that is used to schedule and execute complex workflows. Being an open-source platform it provides wide support for creating architecture of workflows. It is considered one of the most powerful workflow management tools currently in the market.
Airflow makes use of DAG. DAG stands for directed acrylic graphs. These are used to structure workflows, and the main ideology behind airflow is that all the workflows can be represented in form of codes.
Key features of Airflow
- Pure Python: Create your workflows using standard Python features like date-time formats for scheduling and loops to generate tasks dynamically. When creating workflows, this gives you complete flexibility.
- Useful UI: A robust and modern web application that lets you track, schedule, and manage your workflows. There’s no need to get your hands dirty with old cron-style interfaces. You always have complete visibility into the status of completed and ongoing tasks, as well as their logs.
- Robust Integrations: Many plug-and-play operators are available in Airflow, ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and a variety of other third-party services. As a result, Airflow is simple to integrate into existing infrastructure and extend to next-generation technologies.
- Easy to Use: A workflow can be deployed by anyone who knows Python. Apache Airflow has no restrictions on the types of pipelines you can create; you can use it to create machine learning models, transfer data, manage your infrastructure, and more.
- Open Source: You can open a PR anywhere you’d like to share your improvement. That’s all there is to it; no stumbling blocks, no lengthy procedures. Many people use Airflow and are willing to share their stories.
Hevo eliminates the need for complex orchestration by offering a no-code, fully automated platform for data integration and pipeline management. With Hevo, you can bypass the steep learning curve of tools like Airflow and focus on deriving value from your data.
Why Hevo is a Better Alternative to Airflow
- Faster Time to Value: With pre-built connectors and automated processes, Hevo accelerates your data integration workflows compared to the lengthy setup of Airflow jobs.
- No-Code Simplicity: Unlike Airflow, Hevo lets you set up and manage pipelines without writing a single line of code.
- End-to-End Automation: Hevo handles data extraction, transformation, and loading seamlessly—no manual intervention needed.
- Real-Time Data Sync: Hevo ensures real-time data movement and transformations, which is challenging to implement with Airflow.
- Reliability Built-In: Advanced error handling and monitoring ensure your pipelines run smoothly without constant oversight.
GET STARTED WITH HEVO FOR FREE[/hevoButton]
How to Schedule Airflow Jobs?
Airflow Job Scheduler is a tool that monitors the DAG’s in airflow and then triggers DAG’s that have met the dependencies. Once the scheduler is started, it runs continuously to monitor and sync the DAG folder.
The Airflow job scheduler is designed to run in the background as a persistent service in an Airflow production environment. To start the airflow job scheduler you need to execute the Airflow Scheduler command. It will use the configuration specified in airflow.cfg.
The Airflow Jobs Scheduler runs jobs with schedule_interval AFTER the start date, at the END of the period.
Eg: if you run a DAG on a schedule_interval of one day, the run stamped 2022-02-22 will be triggered soon after 2022-02-22 at 23:59 hours. In other words, the Airflow job instance is started once the period it covers has ended.
To start the airflow job scheduler, simply run the command:
airflow scheduler
DAG Runs
The DAG runs represent the instantiation of DAG in form of an object that is used for Airflow Job Scheduling.
since DAG may or may not have a schedule, which informs how DAG RUNS are created. Schedule_interval is a DAG argument that accepts a corn expression as the STR parameter or a datatime.timedelta object.
dag = DAG(
dag_id="02_daily_schedule",
schedule_interval="@daily",
start_date=dt.datetime(2022, 2, 22),
…
)
DAG parameters
Alternatively, you can also use one of these cron “preset”:
preset | meaning | cron |
---|
None | Don’t schedule, use for exclusively “externally triggered” DAGs | |
@once | Schedule once and only once | |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
@daily | Run once a day at midnight | 0 0 * * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@monthly | Run once a month at midnight on the first day of the month | 0 0 1 * * |
@yearly | Run once a year at midnight of January 1 | 0 0 1 1 * |
Note: Use schedule_interval=None and not schedule_interval=’None’ when you don’t want to schedule your DAG.
Your DAG will be instantiated for each schedule while creating a DAG RUN entry for each schedule.
DAG runs have a state status that is linked to them. It can be running, failure, or success. This status is responsible to help the scheduler understand if a task should be evaluated for task submission.
Without the metadata at the DAG run level, the Airflow job scheduler would have much more work to do to figure out what tasks should be triggered and come to a crawl. This might also create more processing requirements when changing the shape of DAG by adding new tasks.n new tasks.
DAG example
Let us create a dag and all the parameters.
import datetime as dt
from pathlib import Path
import pandas as pd
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
dag = DAG(
dag_id="01_unscheduled",
start_date=dt.datetime(2022, 2, 22),
schedule_interval=None,
)
fetch_events = BashOperator(
task_id="fetch_events",
bash_command=(
"mkdir -p /data && "
"curl -o /data/events.json "
"http://localhost:5000/events"
),
dag=dag,
)
def _calculate_stats(input_path, output_path):
"""Calculates event statistics."""
events = pd.read_json(input_path)
stats = events.groupby(["date", "user"]).size().reset_index()
Path(output_path).parent.mkdir(exist_ok=True)
stats.to_csv(output_path, index=False)
calculate_stats = PythonOperator(
task_id="calculate_stats",
python_callable=_calculate_stats,
op_kwargs={
"input_path": "/data/events.json",
"output_path": "/data/stats.csv",
},
dag=dag,
)
fetch_events >> calculate_stats
Conclusion
Airflow is a popular open-source workflow management tool. it provides many features to manage and visualize the workflows. One of its prominent features is to be able to schedule Airflow Jobs using Airflow Jobs Scheduler. This article provided steps to create a DAG and schedule Airflow Jobs along with the parameters that can be used.
Airflow is a trusted source that a lot of companies use as it is an open-source platform. But creating pipelines, installing them on the system, monitoring pipelines, all these are very difficult on Airflow as it is a completely coding platform and it would require a lot of expertise to run properly. This issue can be solved by a platform that creates data pipelines with any code. The Automated data pipeline can be used in place of fully manual solutions to reduce the effort and attain maximum efficiency and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.
Sign up for a 14-day free trial and see the difference!
Share your experience of learning about Apache Jobs Scheduler in the comments section below.
Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.