BigQuery Parquet Integration: 2 Easy Ways to Load Data

By: Published: December 17, 2021

BigQuery Parquet FI

BigQuery is a Google Data Warehouse with built-in Geographic Data Intake, Storage, and Analysis tools. To resolve complex data and examine massive datasets, it uses ordinary SQL queries. Apache Parquet is an open-source column-oriented free data storage format that bears striking resemblance to other columnar-storage file formats available in Hadoop like ORC and RCFile. Parquet is compatible with the majority of Data Processing Frameworks in the Hadoop environment.

This blog talks about the steps you can follow for Parquet BigQuery Data Migration. It also gives a brief introduction to Google BigQuery and Parquet, along with the primary reasons why Parquet is considered superior to other data formats before diving into BigQuery Parquet Integration.

Table of Contents

What is Google BigQuery?

BigQuery Parquet: BigQuery Logo
Image Source

Google BigQuery was developed as a flexible, fast, and powerful Data Warehouse that’s tightly integrated with the other services offered by the Google Platform. It offers user-based pricing, is cost-efficient, and has a Serverless Model. Google BigQuery’s Analytics and Data Warehouse platform leverages a built-in Query Engine on top of the Serverless Model that allows it to process terabytes of data in seconds.

With Google BigQuery, you can run analytics at scale with anywhere between 26% to 34% lower Three-year TCO than other Cloud Data Warehouse alternatives. Since there is no infrastructure to manage or set up, you can focus on finding meaningful insights with the help of Standard SQL and flexible pricing models ranging across Flat-rate and On-demand options.

Google BigQuery’s column-based Storage service provided the impetus for the Data Warehouse’s speed and its ability to handle huge volumes of data. Since column-based storage allows you to process only the columns of interest, it enables you to obtain faster answers and use resources more efficiently. Therefore, for analytical databases, it is more beneficial to store data by column. 

BigQuery Parquet: BigQuery Architecture
Image Source

Key Features of Google BigQuery

Here are a few key features of Google BigQuery:

  • Serverless Service: Generally in a Data Warehouse environment, organizations need to commit and specify the server hardware on which computations will run. Administrators then have to provision for performance, reliability, elasticity, and security. A Serverless Model helps overcome this constraint. In a Serverless Model, the processing is automatically distributed across a large number of machines working in parallel. By using Google BigQuery’s Serverless model, Database Administrators and Data Engineers focus less on infrastructure and more on provisioning servers. This allows them to gain more valuable insights from data.
  • SQL and Programming Language Support: The users can access Google BigQuery through Standard SQL. Apart from this, Google BigQuery also has client libraries for writing applications that access data using Python, C#, Java, PHP, Node.js, Ruby and Go.  
  • Tree Architecture: Google BigQuery and Dremel can easily scale to thousands of machines by structuring computations as an Execution Tree. A Root Server obtains incoming queries and relays them to branches, called Mixers. These branches can then modify the incoming queries and deliver them to Leaf Nodes, also known as Slots. The Leaf Nodes then take care of filtering and reading the data while working in parallel. The results are moved back down the tree followed by Mixers accumulating the results and finally sending them to the root as the answer to the query.
  • Security: Data in Google BigQuery is automatically encrypted either in transit or at rest. Google BigQuery can also isolate jobs and handle security for multi-tenant activity. Since Google BigQuery is integrated with other GCP products’ security features, organizations can take a holistic view of Data Security. It also allows users to share datasets using Google Cloud Identity and Access Management (IAM). Administrators can establish permissions for individuals and groups to access tables, views, and datasets.

What is Parquet?

BigQuery Parquet: Parquet Logo
Image Source

As rapid advancements characterize modern Data Analytics and associated architectures, Time To Market has become the key metric to gauge business success. In the domain of Analytics, Data Architecture plays a key role so that datasets can be arranged for achieving efficiency in the query results. 

Parquet, an open-source file format in the realm of Hadoop ecology, provides users the flexibility to arrange the datasets in Column-Oriented storage to achieve Optimized Data Compression Rates and Encoding Structures. Parquet can enhance the relevant platform performance to handle the data in bulk as Column-Based data can skip the non-relevant headers very quickly. Technically, Apache Parquet is based upon record shredding architecture and assembly algorithm framework which are far better in terms of performance in comparison with meek flattening of nested namespaces. 

Why is Parquet Considered better than other Data Formats?

Here are a few key functionalities of Parquet that allow it to stand taller than other Data Formats:

  • Compatibility: Apache Parquet is an open-source platform with free-of-cost access and is seamlessly compatible with the majority of the Hadoop Data Processing Architectures.
  • Columnar File Format: Unlike the CSV files where data is arranged in rows format, Parquet files consist of row groups, headers, footers while row groups contain respective columns of data and metadata; this arrangement makes Parquet a self-describing format that is well-optimized for quick query fetching and high-performance benchmarks.
  • Compression: Parquet provides users the functionality to compress the data files to minimize the processing load as well as storage requirements. As the compression function is performed column by column, different sorts of encoding can be used for string and integer data types. Data can be compressed by three frameworks in Parquet: Dictionary Encoding, Bit Packing, and Run-length Encoding.  

What are the types of Data that can be Imported into Google BigQuery?

You can easily load data into Google BigQuery from a local file as a batch operation or via Cloud Storage. These are the different source data formats that are compatible with Google BigQuery:

  • ORC
  • JSON (newline-delimited)
  • Avro
  • Comma-Separated Values
  • Firestore exports kept in Google BigQuery
  • Parquet

Apart from these, you can also leverage BigQuery Data Transfer Service to set up repeating loads from Cloud Storage into Google BigQuery.

You need IAM permissions to load data into Google Biguery partitions and tables or run load jobs. Here are the IAM permissions you will need if you wish to load data into a BigQuery partition or table:

  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • bigquery.user (includes the bigquery.jobs.create permission)
  • roles/bigquery.dataEditor
  • bigquery.jobUser (includes the bigquery.jobs.create permission)
  • roles/bigquery.dataOwner

Simplify Google BigQuery ETL using Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from (one of the 40+ Free Data Sources) and 100+ Data Sources and will let you directly load data to a Data Warehouse like Google BigQuery or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get Started with Hevo for free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day Free Trial!

How can you Load Data from Parquet to BigQuery?

Here are the steps you can follow for BigQuery Parquet Integration:

Google BigQuery Parquet Integration: Loading Parquet into a New BigQuery Table

There are a couple of ways in which you can load Parquet data into a BigQuery table as mentioned below:

  • Client Libraries
  • The Cloud Console
  • The bq command-line tool’s bq load command
  • Configuring a load job and jobs.insert API method

Using Google Cloud Console 

Here are the steps involved in this step of Google BigQuery Integration through the Cloud Console:

  • Step 1: In the Console Page, go to the Google BigQuery Page. Then, from the Explorer panel, expand your project and choose a dataset for BigQuery Parquet Integration.
  • Step 2: Expand the Actions option under the three dots, and click Open. Hover over to the Details panel, click on the Create table button (looks sort of a plus).
  • Step 3: On the Create table page, hover to the Source section. For the Create table from field, pick Cloud Storage to continue. Next, in the source field, you can either browse to or enter the Cloud Storage URI. A point to note here is that you can’t include multiple URIs in the Cloud Console, although wildcards are supported. The Cloud Storage Bucket for BigQuery Parquet Integration needs to be in the same location as the dataset that contains the table you’re creating. Choose Parquet for the File format.
BigQuery Parquet: Step 3 of Step 1
Image Source
  • Step 4: On the Create table page, you need to set the field values for the Destination section now. For the Dataset Name, pick the appropriate dataset. You also got to make sure that the Table type has been set to Native table. Finally, in the Table Name field, give the name of the table you’re creating in Google BigQuery for BigQuery Parquet Integration.
BigQuery Parquet: Step 4 of Step 1
Image Source
  • Step 5: In the schema section, you don’t need to take any action. The schema is self-described in Parquet files. Click on Create table button to complete this step of BigQuery Parquet Integration. 

Using Python

Here are the steps involved in this step of Google BigQuery Integration by leveraging Python:

  • Step 1: Follow the Python setup instructions as provided in the BigQuery Quickstart Leveraging Client Libraries for BigQuery Parquet Integration. Next, you need to use the Client.load_table_from_uri() method to start a load job from Cloud Storage. 
  • Step 2: Next, to utilize Parquet, you need to set the LoadJobConfig.source_format property to the string PARQUET. After this, you would like to pass the work configuration as the ‘job_config’ argument to the load_table_from_uri() method for BigQuery Parquet Integration. The code snippet to load data from Parquet to a BigQuery Table is as follows:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.
load_job.result()  # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

Google BigQuery Parquet Integration: Overwriting/ Appending Parquet into an Existing Table

You can easily load additional data for BigQuery Parquet Integration either by appending query results or from source files. In the Cloud Console, you can leverage the Write Preference option to mention what action to take when you load data from either a query result or from a source file. Here are the Cloud Console options you can choose from:

  • Append to Table
  • Write if Empty
  • Overwrite Table

You can overwrite or append Parquet into an existing Google BigQuery table by using one of the following options:

  • Client Libraries
  • The Cloud Console
  • The bq command-line tool’s bq load command
  • Configuring a load job and jobs.insert API method

Using Google Cloud Console

Here are the steps involved in this step of Google BigQuery Integration through the Cloud Console:

  • Step 1: Open the Google BigQuery page located in the Console Page. Next, in the Explorer Panel, you can expand your project and choose a dataset for BigQuery Parquet Integration.
  • Step 2: Expand the Actions option under the three dots, and click Open. Hover to the Details panel, click on the Create table button (looks sort of a plus).
  • Step 3: On the Create table page, hover to the Source section. For the Create table from field, pick Cloud Storage to continue. Next, in the source field, you can either browse to or enter the Cloud Storage URI. A point to note here is that you can’t include multiple URIs in the Cloud Console, although wildcards are supported. The Cloud Storage bucket needs to be in the same location as the dataset that contains the table you’re creating. Choose Parquet for the File format.
BigQuery Parquet: Step 3 of Step 2
Image Source
  • Step 4: On the Create table page, you need to set the field values for the Destination section now. For the Dataset Name, pick the appropriate dataset. You also got to make sure that the Table type has been set to Native table. Finally, in the Table Name field, give the name of the table you’re creating in Google BigQuery.
BigQuery Parquet: Step 4 of Step 2
Image Source
  • Step 5: In the schema section, you don’t need to take any action since the schema has already been extensively described in Parquet files.
  • Step 6: For the Partition and cluster settings, you can work with the default values. However, a table cannot be refashioned to a clustered or partitioned table by overwriting or conjoining it. The Cloud Console doesn’t provide support for overwriting clustered or partitioned tables in a load job.
  • Step 7: Head over to the Advanced options, and choose Overwrite table or Append to table for the Write preference. As far as Encryption goes, you’ll click on the Customer-managed key button to use a Cloud Key Management Service Key. You can even leave the Google-managed key setting, to let BigQuery encrypt the data at rest. For Unknown values, you can leave Ignore unknown values unchecked. This option, however, applies only to JSON and CSV files.
  • Step 8: You’ll click the Create Table button to end this step of BigQuery Parquet Integration.

Using Python

Here are the steps involved in this step of Google BigQuery Integration by leveraging Python:

  • Step 1: Having setup Python, if you wish to replace the rows in an existing table, you need to set the LoadJobConfigure.write_disposition property to WRITE_TRUNCATE. The code snippet to overwrite/append Parquet to an existing Google BigQuery Table is as follows:
import io
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("name", "STRING"),
        bigquery.SchemaField("post_abbr", "STRING"),
    ],
)
body = io.BytesIO(b"Washington,WA")
client.load_table_from_file(body, table_id, job_config=job_config).result()
previous_rows = client.get_table(table_id).num_rows
assert previous_rows > 0
job_config = bigquery.LoadJobConfig(
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    source_format=bigquery.SourceFormat.PARQUET,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.
load_job.result()  # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

Conclusion

This blog talks about the different steps you can follow to set up BigQuery Parquet Integration in a seamless fashion. It also gives a brief introduction to Parquet and BigQuery before diving into the BigQuery Parquet Integration steps.

Visit our Website to Explore Hevo

Hevo Data provides its users with a simpler platform for integrating data from 100+ sources (including 40+ Free Sources) for Analysis. It is a No-code Data Pipeline that can help you combine data from multiple sources. You can use it to transfer data from multiple data sources into your Data Warehouse, Database, or a destination of your choice like Google BigQuery. It provides you with a consistent and reliable solution to managing data in real-time, ensuring that you always have Analysis-ready data in your desired destination.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

mm
Content Marketing Manager, Hevo Data

Amit is a Content Marketing Manager at Hevo Data. He enjoys writing about SaaS products and modern data platforms, having authored over 200 articles on these subjects.

No-code Data Pipeline for BigQuery