Organizations these days are investing in big data analytics as they produce petabytes of data every day from different sources, the need, therefore, arises to analyze these large datasets to uncover trends, hidden truths, previously unknown correlations, patterns, and other information that will be useful to the growth of the enterprise.

With increasing data, applications with favorable compute environments and distributed storage especially those found in the cloud, are in high demand because everyone seems to be in the race against time to discover analytical findings that will help them to have a competitive edge over rivals and better their marketing, sales, and revenue-generating opportunities.

Getting these analytical insights is now a top priority of businesses and with ever high demands coming from clients to use these applications to solve many problems, it is now commonplace to connect two of such applications since one’s entire computing needs cannot be met directly by a single application in solving and deriving possible solutions to challenges faced. 

In this article, you will learn how to integrate Hadoop Snowflake, two applications that are widely used by Developers, Data Scientists, Analysts, and Engineers using Hive Metastore connector to bring their strengths into one fold. The connected platforms will allow users to manage their schema in Hive while querying it from Snowflake.

Introduction to Hadoop

Hadoop Logo
Image Source

Hadoop is an open-source software framework that is used in the processing and storage of data for big data applications in clusters of computer servers built from commodity hardware. It provides massive storage for any kind of data, an enormous processing power, and can take concurrent tasks or jobs by using parallel processing.

It is the bedrock of big data technologies that are used to support advanced analytics initiatives, including predictive analytics, data mining, and machine learning. 

The Hadoop platform has tools that can extract the data from the source systems, be it log files, machine data, or online databases, and load them to Hadoop in record time. It is also possible to do transformations on the fly as well. Complex ETL jobs are deployed and executed in a distributed manner due to the programming and scripting frameworks on Hadoop.

The core of Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part, which is a MapReduce programming model. Hadoop based framework is composed of the following modules:

  • Hadoop Distributed File System (HDFS): A distributed file system that allows data to be stored in an easily accessible format, across a large number of clusters.
  • Hadoop MapReduce: As the name implies, it carries out two basic operations. It reads data from the database, mapping them into a suitable format for analysis, and then implements the MapReduce programming model for large-scale data processing.
  • Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
  • Hadoop YARN (Yet Another Resource Negotiator): Responsible for managing computing resources in clusters and running the analysis.

Hadoop is also used to support advanced analytics initiatives hence, it is used by big data corporations such as IBM, Microsoft, Cloudera, Pivotal Software, Hadapt, Amazon Web Services, and lots more. 

Key Features of Hadoop

Some of the main features of Hadoop are listed below:

  • Highly Scalable: Users can add any number of nodes to the Hadoop cluster (horizontal scaling) or increase the capacity of nodes (vertical scaling) to achieve high computation power.
  • Fault-Tolerant: Hadoop uses a Data Replication mechanism to avoid any data loss at the time of the disaster, system failure and makes the data available all the time using secondary nodes.
  • Easy to Use: Hadoop automatically distributes the computing power and the most process is managed by the framework that makes Hadoop easy to use.

To know more about Hadoop, click here.

Introduction to Snowflake

Snowflake Logo
Image Source

Snowflake is a cloud-based Software-as-a-Service (SaaS) that offers cloud-based storage and analytics service. Its cloud data warehouse is built on Amazon Web Services,  Microsoft Azure, and Google infrastructure providing a platform for storing and retrieving data. Snowflake has a unique architecture that separates its storage unit from its compute unit, thereby scaling differently so customers can use and pay for both independently. 

With Snowflake, there is no hardware or software to select, install, configure, or manage, therefore, making it ideal for organizations that do not want to have dedicated resources for setup, maintenance, and support for in-house servers. 

Since Snowflake decouples the storage and computes functions, which means you can have an unlimited number of concurrent workloads against the same single copy of data and not interfere with the performance of other users, it is highly scalable, flexible with big data, and the sharing functionality of Snowflake makes it easy for organizations to quickly share and secure data in real-time. 

Key Features of Snowflake

Some of the main features of Snowflake are listed below:

  • Scalability: The multi-cluster architecture of the Snowflake separates the storage and compute which allow users to easily scale up and down both independently as per their needs.
  • SQL Support: Snowflake allows users to query data using standard and extended SQL commands so that users can easily access data. It supports DDL, DML, transactions, lateral views, etc.
  • Near-Zero Administration: Snowflake allows companies to easily set up and manage Data Warehouses without significant involvement from DBA or IT teams.

To know more about Snowflake, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Steps to Set Up Hadoop Snowflake Integration

Hadoop Snowflake Integration
Image Source

This section of the write-up is going to explain the process of how to use the Hive Metastore connector for Hadoop Snowflake Integration. Connecting Apache Hive Metastore with Snowflake using external tables.

The Hive connector will detect any subsequent Metastore events which will be transmitted to Snowflake keeping the external tables synchronized with Hive Metastore.

To be able to use this connector, the Apache Hive Metastore must be integrated with cloud storage of either Amazon Web Services, Google Cloud Platform, or Microsoft Azure. The following steps to integrate Hadoop Snowflake are listed below:

Step 1: Installing and Configuring the Hive Metastore Connector

Before installing the Hive Metastore connector for Hadoop Snowflake Integration, it is expected you meet the following requirements:

  • Have a Snowflake database and schemas to store external tables that map to Hive tables in the Metastore.
  • A designated Snowflake user as the connector will be configured to execute operations on the external tables as this user.
  • Storage integration to secure access to external cloud storage without having to pass explicit cloud provider credentials such as secret keys and tokens.
  • A role assigned to the designated Snowflake user that includes object privileges in database, schema, and storage integration such as USAGE, CREATE STAGE, CREATE EXTERNAL TABLE.

Having identified and gotten the prerequisites stated above for Hadoop Snowflake Integration, do the following to install the connector:

  • Download the connector JAR file and configuration XML file from the Maven Central Repository: http://search.maven.org/ or https://repo1.maven.org/maven2/net/snowflake/snowflake-hive-Metastore-connector/
  • Copy the JAR file to the following directory: lib directory in the Hive classpath for Amazon S3 or Google Cloud Storage and hive directory in the user directory for Microsoft Azure HDInsight; for example, /usr/hdp/<hdinsight_version>/atlas/hook/hive/
  • Create a file named snowflake-config.xml file in the conf directory in the Hive classpath for Amazon S3 or Google Cloud Storage and conf/conf.server directory in the Hive classpath for Microsoft Azure HDInsight.
  • Open the snowflake-config.xml file in a text editor and populate it with the <name> properties and corresponding <values> as follows: 
    1. snowflake.jdbc.username is used to specify the login name of the Snowflake user designated for refresh operations on the external tables.
    2. snowflake.jdbc.password specifies the password for the login name.
    3. snowflake.jdbc.account specifies the name of your account provided by Snowflake. 
    4. snowflake.jdbc.db specifies an existing Snowflake database to use for the Hive Metastore integration. 
    5. snowflake.jdbc.schema specifies an existing Snowflake schema in the specified database.  
    6. Snowflake.jdbc.role specifies the access control role to use by the Hive connector for . The role should be an existing role that has already been assigned to the specified user. If no role is specified here, then the Hive connector uses the default role for the specified user.
    7. Snowflake.jdbc.connection specifies the connection string for your Snowflake account in the following format: jdbc:snowflake://<account_identifier>.snowflakecomputing.com
    8. snowflake.hive-Metastore-connector.integration specifies the name of the storage integration object to use for secure access to the external storage locations referenced in Hive tables in the Metastore for Hadoop Snowflake Integration. 
    9. snowflake.hive-Metastore-listener.schemas specify a comma-separated list of Snowflake schemas that exist in the Snowflake database specified in snowflake.jdbc.db.
  • Sample snowflake-config.xml file showing the items listed above.
<configuration>
  <property>
    <name>snowflake.jdbc.username</name>
    <value>jsmith</value>
  </property>
  <property>
    <name>snowflake.jdbc.password</name>
    <value>mySecurePassword</value>
  </property>
  <property>
    <name>snowflake.jdbc.role</name>
    <value>custom_role1</value>
  </property>
  <property>
    <name>snowflake.jdbc.account</name>
    <value>myaccount</value>
  </property>
  <property>
    <name>snowflake.jdbc.db</name>
    <value>mydb</value>
  </property>
  <property>
    <name>snowflake.jdbc.schema</name>
    <value>myschema</value>
  </property>
  <property>
    <name>snowflake.jdbc.connection</name>
    <value>jdbc:snowflake://myaccount.snowflakecomputing.com</value>
  </property>
  <property>
    <name>snowflake.hive-Metastore-listener.integration</name>
    <value>s3_int</value>
  </property>
  <property>
    <name>snowflake.hive-Metastore-listener.schemas</name>
    <value>myschema1,myschema2</value>
  </property>
</configuration>
  • Save the changes to the file. 
  • Edit the existing Hive configuration for Hadoop Snowflake Inetgration file hive-site.xml using a text editor to add the connector to the configuration file as shown below.
<configuration>
 ...
 <property>
  <name>hive.Metastore.event.listeners</name>
  <value>net.snowflake.hiveMetastoreconnector.SnowflakeHiveListener</value>
 </property>
</configuration>
  • Save the changes to the file.
  • Restart the Hive Metastore service.

Step 2: Validating the Hadoop Snowflake Installation

To validate the installation for Hadoop Snowflake Integration follow the steps listed below:

  • Create a new table in Hive.
  • Query the list of external tables in your Snowflake database and schema using SHOW EXTERNAL TABLES as shown below: 
SHOW EXTERNAL TABLES IN <database>.<schema>;

Where database and schema are the database and schema specified in the snowflake-config.xml file you inputed when installing the connector.

The results should show an external table with the same name as the new Hive table. Queries executed by the Hadoop Snowflake connector can be viewed in the Snowflake QUERY_HISTORY view/function output and Hadoop Snowflake Integration connector records are written to the Hive Metastore logs.

For more information about installing a Hive connector for Hadoop Snowflake Integration, validating the connection, and further reading about connecting Hadoop Snowflake visit the official documentation here.

Conclusion

In this article, you learnt about Hadoop, Snowflake and the how to set up Hadoop Snowflake Integration to enable you to enjoy the benefits of both Hadoop Snowflake. It provided instructions on how Hadoop Snowflake Integration can be carried out using the Hive Metastore connector for Hadoop Snowflake to integrate Apache Hive Metastore with Snowflake using external tables. It went further to show you how to test the Hadoop Snowflake connection and stated that with this connection in place, you can manage your schema in Hive while querying from Snowflake.

The processes and procedures mentioned above for Hadoop Snowflake Integration can be confusing and time-eater for a beginner, hence, the need for a subtle way of achieving Hadoop Snowflake Integration. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to the desired Snowflake. It fully automates the process to load and transform data from 100+ sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Hadoop Snowflake Integration in the comments section below!

No-code Data Pipeline For your Snowflake