Apache airflow is a popular workflow management tool that provides functionalities like organizing, scheduling, and generating workflows. Apache Airflow is an open-source tool that makes it free to use. Open source also allows it to have good support. This platform is sussed to orchestrate workflows and it is one of the most popular tools that is widely used and recommended by data engineers.

This tool provides functionality like visualizing the workflows and data pipelines, Scheduling Airflow Jobs, getting the current status of workflows, the data logs related to workflow, and all the codes of workflows in detail. Airflow is a highly scalable solution and also a distributed one. It can also be connected to various sources making it flexible. These features help in solving complex workflow and data pipeline problems efficiently.

One of the main features is to Schedule Airflow jobs. Airflow provides functions with parameters, that can be altered based on the requirement. This article tries to give a guide on using the scheduler to schedule airflow jobs.

Table of Contents

What is Airflow?

airflow jobs: airflow logo | Hevo Data
Image Source: upload.wikimedia.org

Apache Airflow is a workflow management engine that specializes in executing data pipelines that are complex-natured. Its main duty is to ensure that all the steps of the workflow are executed in the order that can be decided before execution. It also ensures that all the tasks get the required resources for efficient execution.

Apache Airflow is an open-source, free-to-use, platform that is used to schedule and execute complex workflows. Being an open-source platform it provides wide support for creating architecture of workflows. It is considered one of the most powerful workflow management tools currently in the market.

Airflow makes use of DAG. DAG stands for directed acrylic graphs. These are used to structure workflows, and the main ideology behind airflow is that all the workflows can be represented in form of codes. 

Features of Apache Airflow

  • Ease of use: Deploying airflow is easy as it requires just a little bit of knowledge in python.
  • Open Source: It is free to use, open-source platform that results in a lot of active users.
  • Good Integrations: It has readily available integrations that allow working with platforms like Google Cloud, Amazon AWS, and many more. 
  • Standard Python for coding: Relatively little knowledge of python can help in creating complex workflows
  • User Interface: Airflow’s UI helps in monitoring and managing the workflows. It also provides a view of the status of tasks.
  • Dynamic: All the tasks of python can be performed in airflow since it is based on python itself.
  • Highly Scalable: Airflow allows the execution of thousands of different tasks per day. 

Key features of Airflow

  • Pure Python: Create your workflows using standard Python features like date-time formats for scheduling and loops to generate tasks dynamically. When creating workflows, this gives you complete flexibility.
  • Useful UI: A robust and modern web application that lets you track, schedule, and manage your workflows. There’s no need to get your hands dirty with old cron-style interfaces. You always have complete visibility into the status of completed and ongoing tasks, as well as their logs.
  • Robust Integrations: Many plug-and-play operators are available in Airflow, ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and a variety of other third-party services. As a result, Airflow is simple to integrate into existing infrastructure and extend to next-generation technologies.
  • Easy to Use: A workflow can be deployed by anyone who knows Python. Apache Airflow has no restrictions on the types of pipelines you can create; you can use it to create machine learning models, transfer data, manage your infrastructure, and more.
  • Open Source: You can open a PR anywhere you’d like to share your improvement. That’s all there is to it; no stumbling blocks, no lengthy procedures. Many people use Airflow and are willing to share their stories.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 100+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

How to Schedule Airflow Jobs?

Airflow Job Scheduler is a tool that monitors the DAG’s in airflow and then triggers DAG’s that have met the dependencies. Once the scheduler is started, it runs continuously to monitor and sync the DAG folder.

The Airflow job scheduler is designed to run in the background as a persistent service in an Airflow production environment. To start the airflow job scheduler you need to execute the Airflow Scheduler command. It will use the configuration specified in airflow.cfg.

The Airflow Jobs Scheduler runs jobs with schedule_interval AFTER the start date, at the END of the period.

Eg: if you run a DAG on a schedule_interval of one day, the run stamped 2022-02-22 will be triggered soon after 2022-02-22 at 23:59 hours. In other words, the Airflow job instance is started once the period it covers has ended.

To start the airflow job scheduler, simply run the command:

airflow scheduler

DAG Runs

The DAG runs represent the instantiation of DAG in form of an object that is used for Airflow Job Scheduling.

since DAG may or may not have a schedule, which informs how DAG RUNS are created. Schedule_interval is a DAG argument that accepts a corn expression as the STR parameter or a datatime.timedelta object.

dag = DAG(
dag_id="02_daily_schedule",
schedule_interval="@daily",
start_date=dt.datetime(2022, 2, 22),
…
)

DAG parameters

Alternatively, you can also use one of these cron “preset”:

presetmeaningcron
NoneDon’t schedule, use for exclusively “externally triggered” DAGs 
@onceSchedule once and only once 
@hourlyRun once an hour at the beginning of the hour0 * * * *
@dailyRun once a day at midnight0 0 * * *
@weeklyRun once a week at midnight on Sunday morning0 0 * * 0
@monthlyRun once a month at midnight on the first day of the month0 0 1 * *
@yearlyRun once a year at midnight of January 10 0 1 1 *

Note: Use schedule_interval=None and not schedule_interval=’None’ when you don’t want to schedule your DAG.

Your DAG will be instantiated for each schedule while creating a DAG RUN entry for each schedule.

DAG runs have a state status that is linked to them. It can be running, failure, or success. This status is responsible to help the scheduler understand if a task should be evaluated for task submission.

Without the metadata at the DAG run level, the Airflow job scheduler would have much more work to do to figure out what tasks should be triggered and come to a crawl. This might also create more processing requirements when changing the shape of DAG by adding new tasks.n new tasks.

DAG example

Let us create a dag and all the parameters.

import datetime as dt
from pathlib import Path
 
import pandas as pd
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
 
dag = DAG(
   dag_id="01_unscheduled",
   start_date=dt.datetime(2022, 2, 22),
   schedule_interval=None,
)
 
fetch_events = BashOperator(
   task_id="fetch_events",
   bash_command=(
      "mkdir -p /data && "
      "curl -o /data/events.json "
      "http://localhost:5000/events"
   ), 
   dag=dag,
)
 
 
def _calculate_stats(input_path, output_path):
   """Calculates event statistics."""
   events = pd.read_json(input_path)
   stats = events.groupby(["date", "user"]).size().reset_index()
   Path(output_path).parent.mkdir(exist_ok=True)
   stats.to_csv(output_path, index=False)
 
 
calculate_stats = PythonOperator(
   task_id="calculate_stats",
   python_callable=_calculate_stats,
   op_kwargs={
       "input_path": "/data/events.json",
       "output_path": "/data/stats.csv",
   },
   dag=dag,
)
 
fetch_events >> calculate_stats

Conclusion

Airflow is a popular open-source workflow management tool. it provides many features to manage and visualize the workflows. One of its prominent features is to be able to schedule Airflow Jobs using Airflow Jobs Scheduler. This article provided steps to create a DAG and schedule Airflow Jobs along with the parameters that can be used.

Airflow is a trusted source that a lot of companies use as it is an open-source platform. But creating pipelines, installing them on the system, monitoring pipelines, all these are very difficult on Airflow as it is a completely coding platform and it would require a lot of expertise to run properly. This issue can be solved by a platform that creates data pipelines with any code. The Automated data pipeline can be used in place of fully manual solutions to reduce the effort and attain maximum efficiency and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built Integrations that you can choose from.

visit our website to explore hevo

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about Apache Jobs Scheduler in the comments section below.

mm
Former Research Analyst, Hevo Data

Arsalan is a data science enthusiast with a keen interest towards data analysis and architecture and is interested in writing highly technical content. He has experience writing around 100 articles on various topics related to data industry.

No-code Data Pipeline For Your Data Warehouse

Get Started with Hevo