Hive can organize every task quickly and assign to a number of different preferences, such as assignee, date, budget, add files, etc. You can also derive insights from Hive making it easier for you to react to any discrepancies in project management. However, you need to integrate the data into a data warehouse like BigQuery to benefit from it. 

How do you do that? In this blog, I will walk you through the methods for data replication from Hive to BigQuery. You will also understand the limitations and benefits of the methods.

Let’s get started!

Explore Three Methods To Migrate Data From Hive To BigQuery

Method 1: Using CSV Files to Connect Hive To BigQuery
This method involves exporting data in Hive as CSV Files and then importing those CSV files into BigQuery.

Method 2: Build A Data Pipeline To Connect Hive To BigQuery
This method is better suited for technically skilled people because it involves building a data pipeline manually with scripts and codes. It is time-consuming and more prone to errors.

Method 3: Use A Fully Automated Data Pipeline – Hevo
Using an automated data pipeline solution like Hevo would be efficient and save time for your organization. It has features such as auto schema mapping and historical and incremental data loading

Get Started with Hevo for Free

Method 1: Using CSV Files to Connect Hive to BigQuery

This method comprises of three steps.

Step 1: Export Data into CSV Files

This method is carried out in two ways based on the version of Hive. 

Hive versionProcedure
For Hive version 11 or higherUse the following command:

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ dictates that the columns should be delimited by a comma.
INSERT OVERWRITE LOCAL DIRECTORY ‘/home/hirw/sales ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ select * from sales_table;
For Hive versions older than 11By default, writing to a file after selecting the hive table would produce a tab-separated file, which is obviously not what you want as you want a comma-separated file.
hive -e ‘select * from sales_table’ > /home/hirw/sales.tsv
With the code below, you can select a table and pipe the results to sed while also passing a regex expression.
hive -e ‘select * from sales_table’ | sed ‘s/[\t]/,/g’ > /home/hirw/sales.csvThe regex expression matches every tab character ([t]) globally and replaces it with a ‘,’.

Step 2: Migrate Data to Google Cloud Storage (GCS)

  • Make sure the file format and encoding are compatible with your storage system before importing a CSV file into Google Cloud Storage. Incompatible files may not be imported correctly and may result in data loss.
  • Many methods exist for importing a CSV file into Google Cloud Storage. Using the gcs command-line tool is the simplest method. You must first establish a bucket in Google Cloud Storage before using the gcs command-line tool to import a CSV file.
  • The CSV file can then be imported into the bucket using the gcs get command.
  • The mybucket bucket in Google Cloud Storage is imported with the help of the following command: 

mybucket myfile.csv gcs

Step 3: Import the CSV Data on GCS into BigQuery

The next step is to import the CSV data on GCS into BigQuery.

The method is not complicated, right? But, there are some limitations to using this method for Hive to BigQuery migration which are:

  • The cloud storage bucket must be in the nearby area of the dataset or be a part of the same multi-region. This condition should be satisfied when your dataset’s location is set to a value other than the US multi-region.
  • With external data sources, BigQuery does not guarantee data consistency. While a query is running, changes to the underlying data can cause unexpected outcomes.
  • Versioning of Cloud Storage objects is not supported by BigQuery. The load process fails if the Cloud Storage URL contains a generation number.

That’s about it. Now, let’s get into the next method which is building a data pipeline to replicate data from Hive to BigQuery.

Replicating Data from Hive to BigQuery
Replicating Data from Google Analytics to BigQuery
Replicating Data from Hive to Snowflake

Method 2: Build a Data Pipeline to Connect Hive to BigQuery

This one is for the ones who would like to get their hands on building something on their own.

Let’s see an example: 

Suppose, Hive is running on a Hadoop on-premise cluster. Following is the shell script to build a data pipeline to connect Hive to BigQuery. 

for each table source_hive_table {

  • INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
  • Move the resulting avro files into GCS using distcp
  • Make first BQ table: bq load –source_format=AVRO your_dataset.something something.avro
  • Selecting from the recently created table and manually resolving any casting issues from BigQuery

}

But, the example shown is for one-time data migration. This won’t help you with data replication for a continuous period of time for large volumes of data. Even if you succeed in building a fully-fledged data pipeline, this method has a few more limitations.

Limitations of Building a Data Pipeline:

  • The entire process takes specific knowledge and skills, including SQL, Python, and Java.
  • You run the danger of taking a long time to set up the procedure.
  • The setup and upkeep of traditional data pipelines are expensive.
  • There is a lack of standardization, automation, and monitorability.

After going through these limitations, do you still feel like opting for this method? Or…

Method 3: Use a Fully Automated Data Pipeline

The benefits of using a fully automated data pipeline for Hive to BigBuery migration are,

  • You will get central access to all your data from Hive and other sources: You can access data from Hive with a ready-to-use data pipeline and replicate it to BigQuery. This will help you get central access to all your data.
  • Automate your data workflows from Hive to BigQuery: An automatic data pipeline will help you stop manually extracting data and automate your Hive BigQuery integration without any coding. They will maintain all pipelines for you and cover all API changes.
  • Enable data-driven decision-making: You will be able to empower everyone in your company with consistent and standardized data by connecting Hive to BigQuery. This will also automate data delivery and measure KPIs across systems.

The benefits of using an automated data pipeline to connect Hive to BigQuery is amazing, right? Companies like Hevo Data can help you with this. What are the steps involved, you ask?

Step 1: Configure Hive as a source

Configuring Hive Source
  • Enter the API Key and pipeline name.
  • Enter the user ID and workplace ID.
  • Test connection.

Step 2: Configure BigQuery as a destination

Configuring BigQuery as a Destination
  • Enter your Destination name and account.
  • Choose Advanced settings as needed.
  • Test and Continue.

Next, let’s get into another important section about how Hive BigQuery migration can help your business. 

What Can you Achieve by Replicating Data from Hive to BigQuery?

Advantages of replicating data from Hive to BigQuery include:

  • Using information from your Hive, you can create a single customer perspective to assess the efficiency of your teams and projects.
  • You will get more extensive customer insights to comprehend the customer journey. And, this offers knowledge that may be put to use at various points in the sales funnel.
  • You can boost customer satisfaction by analyzing the channel interactions with the clients. And, use this data and consumer touchpoints from other channels to arrive at decisions.

It’s time to wrap up!

Conclusion

There are three ways to integrate Hive into BigQuery.

  • The first method is using CSV files. You need to export the data from Hive to a CSV file, then to GCS, and finally to BigQuery.
  • The second method is to build a fully-fledged data pipeline.
  • And, the third method is to use an automated data pipeline. There are pros and cons for each of these methods. You need to understand why you need to replicate data and compare the methods and arrive at a decision.

You can enjoy a smooth ride with Hevo Data’s 150+ plug-and-play integrations (including 60+ free sources. Hevo Data is helping thousands of customers take data-driven decisions through its no-code data pipeline solution.

Check out our 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

FAQs

1) What is the difference between Hive and BigQuery?

Apache Hive and BigQuery have different data type systems. Hive supports more implicit type casting than BigQuery

2) How do I transfer data to BigQuery?

To transfer data from a CSV to BigQuery:
1. In the Create Table, select a data source and use the upload option.
2. Select the file and file format.
3. Define the destination and specify the name of the project and the dataset.

3) What is the Google equivalent of Hive?

The Google equivalent of Hive is Google BigQuery. Hive alternatives can be found in data warehouse solutions.

4) Why use Hive instead of SQL?

Traditional databases are created for small to medium datasets and perform poorly on big datasets. Hive also uses batch processing, thus running complex queries on large datasets.

Anaswara Ramachandran
Content Marketing Specialist, Hevo Data

Anaswara is an engineer-turned-writer specializing in ML, AI, and data science content creation. As a Content Marketing Specialist at Hevo Data, she strategizes and executes content plans leveraging her expertise in data analysis, SEO, and BI tools. Anaswara adeptly utilizes tools like Google Analytics, SEMrush, and Power BI to deliver data-driven insights that power strategic marketing campaigns.