Hive can organize every task quickly and assign to a number of different preferences, such as assignee, date, budget, add files, etc. You can also derive insights from Hive making it easier for you to react to any discrepancies in project management. But, you need to integrate the data into a Data warehouse like BigQuery to benefit from it. 

How do you do that? In this blog, I will walk you through the methods for data replication from Hive to BigQuery. You will also understand the limitations and benefits of the methods.

Let’s get started!

Method 1: Using CSV Files to Connect Hive to BigQuery

The method comprises three steps.

Step 1: Export Data into CSV Files

There are three ways to export data from Hive to CSV. Take a look at the comparison between each of these below.

Step 1: Export Data into CSV Files

This method is carried out in two ways based on the version of Hive. 

Hive versionProcedure
For Hive version 11 or higherUse the following command:

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ dictates that the columns should be delimited by a comma.
INSERT OVERWRITE LOCAL DIRECTORY ‘/home/hirw/sales ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ select * from sales_table;
For Hive versions older than 11By default, writing to a file after selecting the hive table would produce a tab-separated file, which is obviously not what you want as you want a comma-separated file.
hive -e ‘select * from sales_table’ > /home/hirw/sales.tsv
With the code below, you can select a table and pipe the results to sed while also passing a regex expression.
hive -e ‘select * from sales_table’ | sed ‘s/[\t]/,/g’ > /home/hirw/sales.csvThe regex expression matches every tab character ([t]) globally and replaces it with a ‘,’.
Ways to Export Data into CSV Files

       

Step 2: Migrate Data to Google Cloud Storage (GCS)

Make sure the file format and encoding are compatible with your storage system before importing a CSV file into Google Cloud Storage. Incompatible files may not be imported correctly and may result in data loss.

Many methods exist for importing a CSV file into Google Cloud Storage. Using the gcs command-line tool is the simplest method. You must first establish a bucket in Google Cloud Storage before using the gcs command-line tool to import a CSV file.

The CSV file can then be imported into the bucket using the gcs get command.

The mybucket bucket in Google Cloud Storage is imported with the help of the following command: 

mybucket myfile.csv gcs

Step 3: Import the CSV Data on GCS into BigQuery

The procedure involved to import CSV data on GCS into BigQuery are as follows:

Steps to Import CSV Data from GCS into BigQuery
Steps to Import CSV Data from GCS into BigQuery

The method is not complicated, right? But, there are some limitations to using this method for Hive to BigQuery migration which are:

  • The cloud storage bucket must be in the nearby area of the dataset or be a part of the same multi-region. This condition should be satisfied when your dataset’s location is set to a value other than the US multi-region.
  • With external data sources, BigQuery does not guarantee data consistency. While a query is running, changes to the underlying data can cause unexpected outcomes.
  • Versioning of Cloud Storage objects is not supported by BigQuery. The load process fails if the Cloud Storage URL contains a generation number.

That’s about it. Now, let’s get into the next method which is building a data pipeline to replicate data from Hive to BigQuery.

Method 2: Build a Data Pipeline to Connect Hive to BigQuery

This one is for the ones who would like to get their hands on building something on their own.

Let’s see an example: 

Suppose, Hive is running on a Hadoop on-premise cluster. Following is the shell script to build a data pipeline to connect Hive to BigQuery. 

for each table source_hive_table {

  • INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
  • Move the resulting avro files into GCS using distcp
  • Make first BQ table: bq load –source_format=AVRO your_dataset.something something.avro
  • Selecting from the recently created table and manually resolving any casting issues from BigQuery

}

But, this example shown is for one-time data migration. This won’t help you with data replication for a continuous period of time for large volumes of data. Even if you success to build a fully-fledged data pipeline, there are a few more limitations to this method. Let’s take a quick look at them:

Limitations of Data Replication by Building a Data Pipeline
Limitations of Data Replication by Building a Data Pipeline to Connect Hive to BigQuery
  • To connect data and applications across their tech stack, the entire process takes specific knowledge and skills, including SQL, Python, and Java.
  • Every time you develop a specialized solution that calls for a specific skill set, you run the danger of taking a long time to set up the procedure. Especially when some data from Hive needs to be run and people quit the team or business.
  • The setup and upkeep of conventional data pipelines are expensive.
  • There is a lack of automation, standardization, and monitorability, which makes the process of consuming data and extracting insights labor-intensive. By building a complex data stack managed by a committed team, you rely on people to complete all the work necessary to facilitate insights.

After going through these limitations, do you still feel like opting for this method? Or…

Method 3: Use a Fully Automated Data Pipeline

Benefits of a Fully Automated Data Pipeline
Benefits of a Fully Automated Data Pipeline

The benefits of using a fully automated data pipeline for Hive to BigBuery migration are,

  • You will get central access to all your data from Hive and other sources: You can access data from Hive with a ready-to-use data pipeline and replicate it to BigQuery. This will help you get central access to all your data.
  • Automate your data workflows from Hive to BigQuery: An automatic data pipeline will help you stop manually extracting data and automate your Hive BigQuery integration without any coding. They will maintain all pipelines for you and cover all API changes.
  • Enable data-driven decision-making: You will be able to empower everyone in your company with consistent and standardized data by connecting Hive to BigQuery. This will also automate data delivery and measure KPIs across systems.

The benefits of using an automated data pipeline to connect Hive to BigQuery is amazing, right? Companies like Hevo Data can help you with this. What are the steps involved, you ask?

Step 1: Configure Hive as a source

Configuring Hive Source
Configuring Hive Source

Step 2: Configure BigQuery as a destination

Configuring BigQuery as a Destination
Configuring BigQuery as a Destination

Next, let’s get into another important section about how Hive BigQuery migration can help your business. 

What Can you Achieve by Replicating Data from Hive to BigQuery?

Advantages of replicating data from Hive to BigQuery include:

  • Using information from your Hive, you can create a single customer perspective to assess the efficiency of your teams and projects.
  • You will get more extensive customer insights to comprehend the customer journey. And, this offers knowledge that may be put to use at various points in the sales funnel.
  • You can boost customer satisfaction by analyzing the channel interactions with the clients. And, use this data and consumer touchpoints from other channels to arrive at decisions.

It’s time to wrap up!

Conclusion

Data replication from Hive to BigQuery helps businesses in many ways. It includes support to analyze the efficiency of the teams, more in-depth customer insights from the data sources, etc. There are three ways for Hive to BigQuery integration. The first method is using CSV files. You need to export the data from Hive to a CSV file, then to GCS, and finally to BigQuery. 

The second method is to build a fully-fledged data pipeline. And, the third method is to use an automated data pipeline. There are pros and cons for each of these methods. You need to understand why you need to replicate data. And, comparate the methods and arrive at a decision. 

You can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 40+ free sources. Hevo Data is helping thousands of customers take data-driven decisions through its no-code data pipeline solution.

Visit our Website to Explore Hevo

Offering 150+ plug-and-play integrations and saving countless hours of manual data cleaning & standardizing, Hevo Data also offers in-built pre-load data transformations that get it done in minutes via a simple drag-n-drop interface or your custom python scripts. 

Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

mm
Content Marketing Specialist, Hevo Data

Anaswara is an engineer-turned writer having experience writing about ML, AI, and Data Science. She is also an active Guest Author in various communities of Analytics and Data Science professionals including Analytics Vidhya.

No-code Data Pipeline For BigQuery

Get Started with Hevo