Apache Airflow Hive Connection – 3 Easy Ways

Ofem Eteng • Last Modified: December 29th, 2022

Apache Airflow Hive Connection - Featured Image

Scheduling and running complex data pipelines is a common scenario in big data, so ensuring that each task on the data pipeline is executed correctly can be strenuous if the right tools are not used.

A platform that can programmatically author, schedule, and monitor this workflow is, therefore, essential to overcome this challenge, and as such, applications that can handle this have gained prominence among big data users as it eases a lot of unnecessary workloads that were formally incurred. One such scheduling application that can handle all of this is Apache Airflow. 

Furthermore, connecting Airflow to a data warehousing software that enables reading, writing, and the management of large datasets further increases productivity as one can easily query data on the database using Airflow.

This article aims to show how to connect Airflow, a scheduler to Apache Hive, a data warehousing platform.

Table of Content

Introduction to Apache Airflow

Image Source

Apache Airflow is a powerful open-source workflow management tool used to programmatically author, schedule, and monitor data pipelines and workflows. Airflow is widely accepted as it is one of the most veritable platforms used to manage pipelines.

It is written in Python therefore, anyone with basic knowledge of Python can develop workflows created through Python scripts.

Users can also create workflows on Apache Airflow as Directed Acyclic Graphs (DAGs) that contain individual pieces of work known as Tasks arranged with dependencies and data flows through a simple and friendly Graphical User Interface (GUI) making it easy to visualize running pipelines, monitor its progress, and also resolve issues when required. 

Key Features of Apache Airflow

Apache Airflow has the following features:

  • Open Source: Unique libraries can be created by anyone in Airflow as it is an open-source platform with active users sharing experiences without restrictions, thereby, allowing everyone to customize operators, executors, and hooks to suit their specific needs.
  • User Interface: Apache Airflow has a user-friendly interface in which a user can go through the process of monitoring, scheduling, and managing workflows to gain valuable insights about the status of tasks whether ongoing or completed by using a web application.
  • Ease of Usage: A rudimentary knowledge of Python can help you to deploy a workflow and it does not limit the scope of pipelines, enabling more advanced users to build Machine Language models, transfer data, and lots more.
  • Integrations: Airflow allows the integration of third-party apps such as Google Cloud Platform, Amazon Web Services, Microsoft Azure, etc. to it, allowing these operators to execute tasks on Airflow.
  • Scalable: Airflow is built in such a way that it can scale indefinitely, that is, you can create as many dependent workflows as you wish using its modular architecture where messages are queued to organize an arbitrary number of workers.

Apache Airflow Components 

Apache Airflow generally consists of the following components:

  • Scheduler: This handles the triggering of scheduled workflows and submission of Tasks to the executor to run.
  • Executor: The executor handles running tasks within Airflow. 
  • Webserver: This is the user interface where you can monitor, trigger, and debug the behavior of DAGs and Tasks.
  • DAG Files Folder: The DAG folder contains files that the scheduler and executor reads.
  • Metadata Database: This is a database where all DAG and Task metadata are stored. 

For more information on Apache Airflow, visit here.

Scale your data integration effortlessly with Hevo’s Fault-Tolerant No Code Data Pipeline

As the ability of businesses to collect data explodes, data teams have a crucial role to play in fueling data-driven decisions. Yet, they struggle to consolidate the scattered data in their warehouse to build a single source of truth. Broken pipelines, data quality issues, bugs and errors, and lack of control and visibility over the data flow make data integration a nightmare.

1000+ data teams rely on Hevo’s Data Pipeline Platform to integrate data from over 150+ sources in a matter of minutes. Billions of data events from sources as varied as SaaS apps, Databases, File Storage and Streaming sources can be replicated in near real-time with Hevo’s fault-tolerant architecture. What’s more – Hevo puts complete control in the hands of data teams with intuitive dashboards for pipeline monitoring, auto-schema management, and custom ingestion/loading schedules. 

All of this combined with transparent pricing and 24×7 support makes us the most loved data pipeline software on review sites.

Take our 14-day free trial to experience a better way to manage data pipelines.

Get started for Free with Hevo!

Introduction to Apache Hive

Apache Hive
Image Source

Apache Hive is a data warehousing application that runs over the Apache Hadoop framework to enable users to read, write, and manage petabytes of datasets residing in a distributed storage using an SQL-like interface. It provides data querying and analytics capabilities on a massive scale thereby leading to informed data-driven decisions.

Apache Hive runs the Hive Query Language (HQL) and has structures that are similar to standard SQL statements for data querying. MapReduce transformations simplified as queries can easily be written in HQL where Hive then creates the map and reduce the functions to run on Apache Hadoop’s distributed framework. 

The datasets on Hive are stored directly in either the Hadoop Distributed File System (HDFS) or other data storage systems such as Apache HBase. Similar to other Database Management Systems (DBMS), Hive queries can be run from a command-line interface known as Hive Shell, from a Java Database Connectivity (JBDC), or an Open Database Connectivity (ODBC) application.

Key Features of Hive

  • Fast: Hive can handle petabytes of data quickly using batch processing, therefore, making the overall process of reading, writing, and querying data very fast.
  • Scalability: Data can be scaled easily and distributed to meet specific needs using Hive.
  • Ease of Usage: Since Hive provides a similar outlook to the SQL interface, it is easy to use and readily accessible by non-core programmers.

Setting up Apache Airflow Hive Connection

Airflow connection provides the user the opportunity to use an object for the storage of credentials and other information necessary for connecting to external services. Airflow connections can be carried out for numerous external services which may be defined in environment variables, in an external Secrets Backend, or in the Airflow metadata database using the CLI or web UI.

For Apache Airflow Hive connection, three connection types can be used namely:

Hive CLI Connection

The  Hive CLI connection type enables users to use the Apache Airflow Hive integrations to create a connection. You can connect to Hive using Airflow in the Hive CLI connection method, either through the Hive Beeline or Hive CLI, but only one authorization method can be used at a time. If you want to manage multiple credentials or keys, this is made possible by having multiple connections.

  1. Using the Hive Beeline: This involves making a JDBC connection with a string of host, port, and schema or connecting to Hive using a proxy user before specifying a login and password.
  2. Using the Hive CLI: In this case, you will specify the Hive CLI parameters in the Extra field.

Configuring Connection

To execute the Apache Airflow Hive connection using Hive CLI Connection from any of the two methods listed above, the first step is to configure the connection using the following optional parameters:

  • Login: This is used to specify the username for a proxy user or the Beeline CLI.
  • Password: This is used to specify the Beeline CLI password.
  • Host: This is used to specify the JDBC Hive host that will be used for Hive Beeline.
  • Port: This is used to specify the JDBC Hive port that will be used for the Hive Beeline.
  • Schema: This is used to specify the JDBC Hive database that will be connected to using the Beeline.
  • Extra: These are optional extra parameters that can be used in the Hive CLI connection, they contain the following:
    • hive_cli_params: This is used to specify an object CLI params for use with Beeline CLI and Hive CLI.
    • use_beeline: This is used to specify as True if using the Beeline CLI, the default setting is False.
    • auth: This is used to specify the auth type for use with Hive Beeline CLI.
    • proxy_user: This is used to specify a proxy user as an owner. It can also be kept blank if using a custom proxy user.
    • principal: This is used to specify the JDBC Hive principal to be used with Hive Beeline.

In using the Hive CLI connection method, all hooks and operators related to Hive_CLI use the hive_cli_default by default. When specifying the connection in the environment variable, it should be done using URL-encoded as shown below. 

#Apache Airflow Hive
AIRFLOW_CONN_HIVE_CLI_DEFAULT='hive-cli://beeline-username:beeline-password@jdbc-hive-host:80/hive-database?hive_cli_params=params&use_beeline=True&auth=noSasl&principal=hive%2F_HOST%40EXAMPLE.COM'

Hive Metastore Connection

This is used for the Apache Airflow Hive Metastore integrations and you will be required to authenticate with the Hive Metastore through Apache Thrift Hive Server, an optional service that allows a remote client to submit a request to Hive and the hmsclient, a project that provides an up to date Python client to interact with Hive Metastore using the Thrift protocol. The default connection ID for Hive Metastore is metastore_default and should be used for all hooks and operators related to the Hive Metastore connection.

Configuring Connection

To execute the Apache Airflow Hive connection using Hive Metastore Connection, the first thing is to configure the connection using the following optional parameters:

  • Host: This is used to specify the host of your Metastore node. Multiple hosts can also be specified by using a comma-separated list.
  • Port: This is used to specify the Metastore port number.
  • Extra: This is used to specify extra optional parameters to be used in the Hive Metastore connection, they include:
    • authMechanism: This is used to specify the mechanism for authentication. The default is NOSASL.
    • kerberos_service_name: This is used to specify the Kerberos service name. The default is hive.

When specifying the connection in the environment variable, the Apache Airflow Hive connection should be done using the URI syntax and should be URL-encoded as shown below. 

AIRFLOW_CONN_METASTORE_DEFAULT='hive-metastore://hive-metastore-node:80?authMechanism=NOSASL'

Hive Server2 Connection

This method is used for Hive Server2 integrations and connecting to Hive Server2 is done using PyHive. For authenticating, either LDAP, Kerberos, or custom can be used.

The default connection ID for the Hive Server2 connection is hiveserver2_default and should be used for all hooks and operators associated with Hive Server2.

Configuring Connection

To carry out the Apache Airflow Hive connection using Hive Server2 Connection, you will have to configure the following parameters:

  • Login: This is used to specify your Hive Server2 username.
  • Password: This is used to specify the Hive password that will be used with LDAP and custom authentication.
  • Host: This is used to specify the host node to be used in the connection.
  • Port: This is used to specify the Hive Server2 port number.
  • Schema: This is used to specify the name of the associated database in which the connection of Hive Server2 will be made.
  • Extra: This is used to specify extra optional parameters to be used in the Hive Server2 connection, they include:
    • authMechanism: This is used to specify the authentication method for PyHive, you can choose from either PLAIN, LDAP, KERBEROS, or custom. The default is PLAIN.
    • kerberos_service_name: This is used to specify the Kerberos service name if you are authenticating with Kerberos. The default is hive.
    • run_set_variable_statements: This is used to specify if you want to run set variable statements. Its default is True.

When specifying the connection in the environment variable, it should be done using the URI syntax and should be URL-encoded as shown below. 

#Apache Airflow Hive
AIRFLOW_CONN_HIVESERVER2_DEFAULT='hiveserver2://username:password@hiveserver2-node:80/database?authMechanism=LDAP'

Apache Hive Operators

As stated before, Apache Hive data warehouse software facilitates reading, writing, and the management of enormous datasets resident in a distributed storage using SQL.

After creating the necessary connection, the HiveOperator can be used in projecting structures onto data already in storage by executing HQL code or hive script in a specific Hive database as described using the code below.

#Apache Airflow Hive 
load_to_hive = HiveOperator(
        task_id=f"load_{channel}_to_hive",
        hql=(
            f"LOAD DATA INPATH '{hdfs_dir}{channel}/{file_name}'"
            f"INTO TABLE {channel}"
            f"PARTITION(dt='{dt}')"
        ),
    )
#Apache Airflow Hive

Airflow Providers

Airflow providers are packages that contain core Operators and Hooks for a certain service. The following are the airflow.providers.apache.hive provider sub-packages for the Airflow Hive connection.

#Apache Airflow Hive
Subpackages
airflow.providers.apache.hive.hooks
   airflow.providers.apache.hive.hooks.hive
airflow.providers.apache.hive.operators
   airflow.providers.apache.hive.operators.hive
   airflow.providers.apache.hive.operators.hive_stats
airflow.providers.apache.hive.sensors
   airflow.providers.apache.hive.sensors.hive_partition
   airflow.providers.apache.hive.sensors.metastore_partition
   airflow.providers.apache.hive.sensors.named_hive_partition
airflow.providers.apache.hive.transfers
   airflow.providers.apache.hive.transfers.hive_to_mysql
   airflow.providers.apache.hive.transfers.hive_to_samba
   airflow.providers.apache.hive.transfers.mssql_to_hive
   airflow.providers.apache.hive.transfers.mysql_to_hive
   airflow.providers.apache.hive.transfers.s3_to_hive
   airflow.providers.apache.hive.transfers.vertica_to_hive

#Apache Airflow Hive

Conclusion

This article has shown easy steps to set up the Apache Airflow Hive connection. Airflow enables the scheduling and execution of tasks, whereas, Hive is a data warehousing system.

The combination of both resources as described in the article enables the handling of big data in such a way that ensures flexibility and monitoring of workflows.

If you want your data to be analyzed and hassle-free, An Automated Data Pipeline, Hevo comes into the picture.

Visit our Website to Explore Hevo

Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. Hevo provides a wide range of sources – 150+ Data Sources (including 40+ Free Sources) – that connect with over 15+ Destinations and load them into a destination to analyze real-time data at a transparent price making Data Migration hassle-free.

SIGN UP for a 14-day free trial and see the difference!

Share your thoughts on Apache Airflow Hive Connection in the comments section below.

No-code Automated Data Pipeline for your Data Warehouse