Are you curious about how you can use Airflow to run bash commands? The Airflow BashOperator accomplishes exactly what you want. Operators and sensors (which are also a type of operator) are used in Airflow to define tasks. An Airflow Operator is referred to as a task of the DAG(Directed Acyclic Graphs) once it has been instantiated within a DAG. Airflow supports various operators such as BashOperator, PythonOperator, EmailOperator, SimpleHttpOperator, and many more.

The Airflow BashOperator allows you to specify any given Shell command or script and add it to an Airflow workflow. This can be a great start to implementing Airflow in your environment. This post highlights the details of Airflow BashOperator. You will learn its syntax, various method parameters and try an example to create different tasks using Airflow BashOperator. Before you get started, let’s get familiar with what is Airflow and why it’s so prominent in the industry.

What is Apache Airflow?

Airflow BashOperator - Apache Airflow Logo

Apache Airflow is an open-source application for writing, scheduling, and monitoring workflows. It’s one of the most trusted solutions for orchestrating operations or Pipelines among Data Engineers. Your Data Pipelines can all be monitored in real-time. Airflow has evolved into one of the most powerful open source Data Pipeline systems currently offered in the market.

Airflow BashOperator - Airflow DAG
Image Source

Airflow allows users to create workflows as DAGs (Directed Acyclic Graphs) of tasks. Visualizing pipelines in production, monitoring progress, and resolving issues is a snap with Airflow’s robust User Interface. It connects to a variety of data sources and can send notifications to users through email or Slack when a process is completed or fails. Since it is distributed, scalable, and flexible, it is ideal for orchestrating complicated Business Logic.

Key Features of Apache Airflow

Let’s have a look at some of the outstanding features that set Airflow apart from its competitors:

  • Easy to Use: An Airflow Data Pipeline can be readily set up by anybody familiar with Python programming language. Users can develop Machine Learning models, manage infrastructure, and send data with no restrictions on pipeline scope. It also enables users to pick up where they left off without having to restart the entire operation.
  • Robust Pipelines: Airflow pipelines are simple and robust. It’s built on the advanced Jinja template engine, which allows you to parameterize your scripts. Furthermore, owing to the advanced scheduling semantics, users can run pipelines at regular intervals.
  • Scalable: Airflow is a modular solution that orchestrates an arbitrary number of workers via a message queue. It’s a general-purpose orchestration framework with a user-friendly set of features.
  • High Extensibility with Robust Integrations: Airflow offers many operators to operate on Google Cloud Platform, Amazon Web Services, and a variety of other third-party platforms. As a result, integrating next-generation technologies into existing infrastructure and scaling up is simple.
  • Pure Python: Users can create Data Pipelines with Airflow by leveraging basic Python features like data time formats for scheduling and loops for dynamically creating tasks. This gives users as much power as possible when creating Data Pipelines.

Do you want to learn more about Apache Airflow’s other significant features and benefits? Refer to the Airflow Official Page.

What is Airflow BashOperator?

Airflow BashOperator - Bash Operator

The Airflow BashOperator is used on the system to run a Bash script, command, or group of commands. You can import Airflow BashOperator using the following command:

from airflow.operators.bash_operator import BashOperator

Airflow BashOperator Method Syntax:

class airflow.operators.bash.BashOperator(*, bash_command: str, env: Optional[Dict[str, str]] = None, output_encoding: str = 'utf-8', skip_exit_code: int = 99, cwd: str = None, **kwargs)

Airflow BashOperator Method Parameters:

  • bash_command: The command, collection of commands, or reference to a bash script to run.
  • env: If env is specified other than None, it must be a dictionary that specifies the new process’s environment variables, rather than inheriting the existing process’s environment, which is the default.
  • output_encoding: Specify the bash command’s output encoding.
  • skip_exit_code: Leave the task in the skipped state if it terminates with the default exit code(99). 
  • cwd: Specify in which directory should the command be run. If None, the command will be executed in a temporary directory.

Airflow BashOperator Exit Code:

Airflow evaluates the exit code of the bash command. The following table lists down the various exit codes along with their behavior: 

Exit codeBehavior
0Success
skip_exit_code (default: 99)Raise airflow.exceptions.AirflowSkipException: Raised when the task should be skipped
otherwiseRaise airflow.exceptions.AirflowException: Raised when the web server times out
Airflow BashOperator Exit Code Table
Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 150+ Data Sources including 40+ Free Sources. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. 

Hevo loads the data onto the desired Data Warehouse/destination in real-time and enriches the data and transforms it into an analysis-ready form without having to write a single line of code. Its completely automated pipeline, fault-tolerant, and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

GET STARTED WITH HEVO FOR FREE

How to Use the Airflow BashOperator?

Now that you have gained a basic understanding of Airflow BashOperator, its syntax, and parameters, in this section, you will learn how to create tasks and run the workflow.

So, create a DAG folder and upload the below Python script. This will render into the Airflow server and display the Airflow user interface. You can then manually trigger it or set it to trigger automatically.

The following Python script (Source) uses Airflow BashOperator to create 2 Tasks. Task 1 is used to create a new directory whereas Task 2 is used to delete the directory. Copy this script into your DAG folder and it will automatically get loaded into the server.

# Import all important packages
from datetime import datetime
from airflow import models
from airflow.operators.bash_operator import BashOperator   #imported BashOperator method

# Get the yesterday timestamp
yesterday = datetime.datetime.combine(
   datetime.datetime.today() - datetime.timedelta(1),
   datetime.datetime.min.time())

# Create a dictionary of default arguments for each task to set the task’s constructor
default_dag_args = {
    'start_date': yesterday,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': datetime.timedelta(minutes=5),
    'project_id': models.Variable.get('gcp_project')
}

with models.DAG(
        ‘Bash_operations’,
        schedule_interval=datetime.timedelta(days=1),
        default_args=default_dag_args) as dag:

# Task 1: To create a directory with given folder name 
t1 = BashOperator(
	task_id=’Make directory’, bash_command=’mkdir folder_name’, dag=dag)

# Task 2: Delete a directory with given folder name
t2 = BashOperator(
	task_id=’delete directory’, bash_command=’rm -rf folder_name’, dag=dag)

# Set Task Dependency. Here, Task 1 should run before Task 2
t1 >> t2   

For the above script you can visualize the following Task graph:

Airflow BashOperator Example
Image Source

The arrow in the above graph denotes the Task Dependency i.e the make_directory Task 1 is dependent on delete_directory Task 2. 

Great Work! You have understood the very basic example of using the Airflow BashOperators. You can either customize the above script or create a new one from scratch to learn more. You can also refer to BashOperator — Airflow Documentation for more details. 

Examples of Airflow BashOperator

Apache Airflow is used by many firms, including Slack, Robinhood, Freetrade, 9GAG, Square, Walmart, and others. Let’s take a look at how you can use Airflow BashOperator with leading Data Warehouses like Google BigQuery and with Amazon Managed Workflows for Apache Airflow.

1) Call the BigQuery bq command

In an Apache Airflow DAG, you can use the Airflow BashOperator to invoke the BigQuery bq command as shown below.

from airflow.operators import bash
    # Create BigQuery output dataset.
    make_bq_dataset = bash.BashOperator(
        task_id='make_bq_dataset',
        # Executing 'bq' command requires Google Cloud SDK which comes
        # preinstalled in Cloud Composer.
        bash_command=f'bq ls {bq_dataset_name} || bq mk {bq_dataset_name}')

Explore more about this example here.

2) Run Bash commands in Amazon Managed Workflows for Apache Airflow (MWAA)

The Airflow BashOperator can be used to perform bash commands from a DAG in the Amazon Managed Workflows for Apache Airflow (MWAA). Consider the following scenario:

from airflow import DAG
  from airflow.operators.bash_operator import BashOperator
  from airflow.utils.dates import days_ago
  with DAG(dag_id="any_bash_command_dag", schedule_interval=None, catchup=False, start_date=days_ago(1)) as dag:
      cli_command = BashOperator(
          task_id="bash_command",
          bash_command="{{ dag_run.conf['command'] }}"
      )

Explore more about this example here.

Learn More About:

Conclusion

This post provided a comprehensive overview of Airflow BashOperator. You understood its syntax, parameters, and various exit codes. In addition, you were introduced to key features of Airflow. Furthermore, you tried hands-on with creating and deleting directories using Airflow BashOperator. At the end of this article, you explore the various use cases of Airflow BashOperator.

As a Developer, extracting complex data from a diverse set of data sources like CRMs, Project management Tools, Streaming Services, Marketing Platforms to your desired data destination can be quite challenging. This is where a simpler alternative like Hevo can save your day! Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources including 40+ Free Sources, into your Data Warehouse to be visualized in a BI tool. It is robust, fully automated, and hence does not require you to code.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience with the Airflow BashOperator in the comments section below!

Shubhnoor Gill
Research Analyst, Hevo Data

Shubhnoor is a data analyst with a proven track record of translating data insights into actionable marketing strategies. She leverages her expertise in market research and product development, honed through experience across diverse industries and at Hevo Data. Currently pursuing a Master of Management in Artificial Intelligence, Shubhnoor is a dedicated learner who stays at the forefront of data-driven marketing trends. Her data-backed content empowers readers to make informed decisions and achieve real-world results.