Apache airflow is a popular workflow management tool that provides functionalities like organizing, scheduling, and generating workflows. Apache Airflow is an open-source tool that makes it free to use. Open source also allows it to have good support. This platform is sussed to orchestrate workflows and it is one of the most popular tools that is widely used and recommended by data engineers.
This tool provides functionality like visualizing the workflows and data pipelines, Scheduling Airflow Jobs, getting the current status of workflows, the data logs related to workflow, and all the codes of workflows in detail. Airflow is a highly scalable solution and also a distributed one. It can also be connected to various sources making it flexible. These features help in solving complex workflow and data pipeline problems efficiently.
One of the main features is to Schedule Airflow jobs. Airflow provides functions with parameters, that can be altered based on the requirement. This article tries to give a guide on using the scheduler to schedule airflow jobs.
What is Airflow?
Apache Airflow is a workflow management engine that specializes in executing data pipelines that are complex-natured. Its main duty is to ensure that all the steps of the workflow are executed in the order that can be decided before execution. It also ensures that all the tasks get the required resources for efficient execution.
Apache Airflow is an open-source, free-to-use, platform that is used to schedule and execute complex workflows. Being an open-source platform it provides wide support for creating architecture of workflows. It is considered one of the most powerful workflow management tools currently in the market.
Airflow makes use of DAG. DAG stands for directed acrylic graphs. These are used to structure workflows, and the main ideology behind airflow is that all the workflows can be represented in form of codes.
Features of Apache Airflow
- Ease of use: Deploying airflow is easy as it requires just a little bit of knowledge in python.
- Open Source: It is free to use, open-source platform that results in a lot of active users.
- Good Integrations: It has readily available integrations that allow working with platforms like Google Cloud, Amazon AWS, and many more.
- Standard Python for coding: Relatively little knowledge of python can help in creating complex workflows
- User Interface: Airflow’s UI helps in monitoring and managing the workflows. It also provides a view of the status of tasks.
- Dynamic: All the tasks of python can be performed in airflow since it is based on python itself.
- Highly Scalable: Airflow allows the execution of thousands of different tasks per day.
Key features of Airflow
- Pure Python: Create your workflows using standard Python features like date-time formats for scheduling and loops to generate tasks dynamically. When creating workflows, this gives you complete flexibility.
- Useful UI: A robust and modern web application that lets you track, schedule, and manage your workflows. There’s no need to get your hands dirty with old cron-style interfaces. You always have complete visibility into the status of completed and ongoing tasks, as well as their logs.
- Robust Integrations: Many plug-and-play operators are available in Airflow, ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and a variety of other third-party services. As a result, Airflow is simple to integrate into existing infrastructure and extend to next-generation technologies.
- Easy to Use: A workflow can be deployed by anyone who knows Python. Apache Airflow has no restrictions on the types of pipelines you can create; you can use it to create machine learning models, transfer data, manage your infrastructure, and more.
- Open Source: You can open a PR anywhere you’d like to share your improvement. That’s all there is to it; no stumbling blocks, no lengthy procedures. Many people use Airflow and are willing to share their stories.
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
GET STARTED WITH HEVO FOR FREE[/hevoButton]
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
How to Schedule Airflow Jobs?
Airflow Job Scheduler is a tool that monitors the DAG’s in airflow and then triggers DAG’s that have met the dependencies. Once the scheduler is started, it runs continuously to monitor and sync the DAG folder.
The Airflow job scheduler is designed to run in the background as a persistent service in an Airflow production environment. To start the airflow job scheduler you need to execute the Airflow Scheduler command. It will use the configuration specified in airflow.cfg.
The Airflow Jobs Scheduler runs jobs with schedule_interval AFTER the start date, at the END of the period.
Eg: if you run a DAG on a schedule_interval of one day, the run stamped 2022-02-22 will be triggered soon after 2022-02-22 at 23:59 hours. In other words, the Airflow job instance is started once the period it covers has ended.
To start the airflow job scheduler, simply run the command:
airflow scheduler
DAG Runs
The DAG runs represent the instantiation of DAG in form of an object that is used for Airflow Job Scheduling.
since DAG may or may not have a schedule, which informs how DAG RUNS are created. Schedule_interval is a DAG argument that accepts a corn expression as the STR parameter or a datatime.timedelta object.
dag = DAG(
dag_id="02_daily_schedule",
schedule_interval="@daily",
start_date=dt.datetime(2022, 2, 22),
…
)
DAG parameters
Alternatively, you can also use one of these cron “preset”:
preset | meaning | cron |
---|
None | Don’t schedule, use for exclusively “externally triggered” DAGs | |
@once | Schedule once and only once | |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
@daily | Run once a day at midnight | 0 0 * * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@monthly | Run once a month at midnight on the first day of the month | 0 0 1 * * |
@yearly | Run once a year at midnight of January 1 | 0 0 1 1 * |
Note: Use schedule_interval=None and not schedule_interval=’None’ when you don’t want to schedule your DAG.
Your DAG will be instantiated for each schedule while creating a DAG RUN entry for each schedule.
DAG runs have a state status that is linked to them. It can be running, failure, or success. This status is responsible to help the scheduler understand if a task should be evaluated for task submission.
Without the metadata at the DAG run level, the Airflow job scheduler would have much more work to do to figure out what tasks should be triggered and come to a crawl. This might also create more processing requirements when changing the shape of DAG by adding new tasks.n new tasks.
DAG example
Let us create a dag and all the parameters.
import datetime as dt
from pathlib import Path
import pandas as pd
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
dag = DAG(
dag_id="01_unscheduled",
start_date=dt.datetime(2022, 2, 22),
schedule_interval=None,
)
fetch_events = BashOperator(
task_id="fetch_events",
bash_command=(
"mkdir -p /data && "
"curl -o /data/events.json "
"http://localhost:5000/events"
),
dag=dag,
)
def _calculate_stats(input_path, output_path):
"""Calculates event statistics."""
events = pd.read_json(input_path)
stats = events.groupby(["date", "user"]).size().reset_index()
Path(output_path).parent.mkdir(exist_ok=True)
stats.to_csv(output_path, index=False)
calculate_stats = PythonOperator(
task_id="calculate_stats",
python_callable=_calculate_stats,
op_kwargs={
"input_path": "/data/events.json",
"output_path": "/data/stats.csv",
},
dag=dag,
)
fetch_events >> calculate_stats
Conclusion
Airflow is a popular open-source workflow management tool. it provides many features to manage and visualize the workflows. One of its prominent features is to be able to schedule Airflow Jobs using Airflow Jobs Scheduler. This article provided steps to create a DAG and schedule Airflow Jobs along with the parameters that can be used.
Airflow is a trusted source that a lot of companies use as it is an open-source platform. But creating pipelines, installing them on the system, monitoring pipelines, all these are very difficult on Airflow as it is a completely coding platform and it would require a lot of expertise to run properly. This issue can be solved by a platform that creates data pipelines with any code. The Automated data pipeline can be used in place of fully manual solutions to reduce the effort and attain maximum efficiency and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.
visit our website to explore hevo
Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.
SIGN UP for a 14-day free trial and see the difference!
Share your experience of learning about Apache Jobs Scheduler in the comments section below.
Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.