BigQuery is a Google Data Warehouse with built-in Geographic Data Intake, Storage, and Analysis tools.
To resolve complex data and examine massive datasets, it uses ordinary SQL queries.
Apache Parquet is an open-source column-oriented free data storage format that bears striking resemblance to other columnar-storage file formats available in Hadoop like ORC and RCFile. Parquet is compatible with the majority of Data Processing Frameworks in the Hadoop environment.
A brief introduction to Google BigQuery and Parquet, along with the primary reasons why Parquet is considered superior to other data formats before diving into BigQuery Parquet Integration.
How can you Load Data from Parquet to BigQuery?
Here are the steps you can follow for BigQuery Parquet Integration:
Loading Parquet into a New BigQuery Table
There are a couple of ways in which you can load Parquet data into a BigQuery table as mentioned below:
- Client Libraries
- The Cloud Console
- The bq command-line tool’s bq load command
- Configuring a load job and jobs.insert API method
Using Google Cloud Console
Here are the steps involved in this step of Google BigQuery Integration through the Cloud Console:
- Step 1: In the Console Page, go to the Google BigQuery Page. Then, from the Explorer panel, expand your project and choose a dataset for BigQuery Parquet Integration.
- Step 2: Expand the Actions option under the three dots, and click Open. Hover over to the Details panel, click on the Create table button (looks sort of a plus).
- Step 3: On the Create table page, hover to the Source section. For the Create table from field, pick Cloud Storage to continue. Next, in the source field, you can either browse to or enter the Cloud Storage URI. A point to note here is that you can’t include multiple URIs in the Cloud Console, although wildcards are supported. The Cloud Storage Bucket for BigQuery Parquet Integration needs to be in the same location as the dataset that contains the table you’re creating. Choose Parquet for the File format.
- Step 4: On the Create table page, you need to set the field values for the Destination section now. For the Dataset Name, pick the appropriate dataset. You also got to make sure that the Table type has been set to Native table. Finally, in the Table Name field, give the name of the table you’re creating in Google BigQuery for BigQuery Parquet Integration.
- Step 5: In the schema section, you don’t need to take any action. The schema is self-described in Parquet files. Click on Create table button to complete this step of BigQuery Parquet Integration.
Using Python
Here are the steps involved in this step of Google BigQuery Integration by leveraging Python:
- Step 1: Follow the Python setup instructions as provided in the BigQuery Quickstart Leveraging Client Libraries for BigQuery Parquet Integration. Next, you need to use the Client.load_table_from_uri() method to start a load job from Cloud Storage.
- Step 2: Next, to utilize Parquet, you need to set the LoadJobConfig.source_format property to the string PARQUET. After this, you would like to pass the work configuration as the ‘job_config’ argument to the load_table_from_uri() method for BigQuery Parquet Integration. The code snippet to load data from Parquet to a BigQuery Table is as follows:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET,)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))
Connect Google Cloud Storage to BigQuery
Connect Google Ads to BigQuery
Connect Google Analytics to BigQuery
Overwriting/ Appending Parquet into an Existing Table
You can easily load additional data for BigQuery Parquet Integration either by appending query results or from source files. In the Cloud Console, you can leverage the Write Preference option to mention what action to take when you load data from either a query result or from a source file. Here are the Cloud Console options you can choose from:
- Append to Table
- Write if Empty
- Overwrite Table
You can overwrite or append Parquet into an existing Google BigQuery table by using one of the following options:
- Client Libraries
- The Cloud Console
- The bq command-line tool’s bq load command
- Configuring a load job and jobs.insert API method
Using Google Cloud Console
Here are the steps involved in this step of Google BigQuery Integration through the Cloud Console:
- Step 1: Open the Google BigQuery page located in the Console Page. Next, in the Explorer Panel, you can expand your project and choose a dataset for BigQuery Parquet Integration.
- Step 2: Expand the Actions option under the three dots, and click Open. Hover to the Details panel, click on the Create table button (looks sort of a plus).
- Step 3: On the Create table page, hover to the Source section. For the Create table from field, pick Cloud Storage to continue. Next, in the source field, you can either browse to or enter the Cloud Storage URI. A point to note here is that you can’t include multiple URIs in the Cloud Console, although wildcards are supported. The Cloud Storage bucket needs to be in the same location as the dataset that contains the table you’re creating. Choose Parquet for the File format.
- Step 4: On the Create table page, you need to set the field values for the Destination section now. For the Dataset Name, pick the appropriate dataset. You also got to make sure that the Table type has been set to Native table. Finally, in the Table Name field, give the name of the table you’re creating in Google BigQuery.
- Step 5: In the schema section, you don’t need to take any action since the schema has already been extensively described in Parquet files.
- Step 6: For the Partition and cluster settings, you can work with the default values. However, a table cannot be refashioned to a clustered or partitioned table by overwriting or conjoining it. The Cloud Console doesn’t provide support for overwriting clustered or partitioned tables in a load job.
- Step 7: Head over to the Advanced options, and choose Overwrite table or Append to table for the Write preference. As far as Encryption goes, you’ll click on the Customer-managed key button to use a Cloud Key Management Service Key. You can even leave the Google-managed key setting, to let BigQuery encrypt the data at rest. For Unknown values, you can leave Ignore unknown values unchecked. This option, however, applies only to JSON and CSV files.
- Step 8: You’ll click the Create Table button to end this step of BigQuery Parquet Integration.
Using Python
Here are the steps involved in this step of Google BigQuery Integration by leveraging Python:
- Step 1: Having setup Python, if you wish to replace the rows in an existing table, you need to set the LoadJobConfigure.write_disposition property to WRITE_TRUNCATE. The code snippet to overwrite/append Parquet to an existing Google BigQuery Table is as follows:
import io
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("post_abbr", "STRING"),
],
)
body = io.BytesIO(b"Washington,WA")
client.load_table_from_file(body, table_id, job_config=job_config).result()
previous_rows = client.get_table(table_id).num_rows
assert previous_rows > 0
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
source_format=bigquery.SourceFormat.PARQUET,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))
Why is Parquet Considered better than other Data Formats?
Here are a few key functionalities of Parquet that allow it to stand taller than other Data Formats:
- Compatibility: Apache Parquet is an open-source platform with free-of-cost access and is seamlessly compatible with the majority of the Hadoop Data Processing Architectures.
- Columnar File Format: Unlike the CSV files where data is arranged in rows format, Parquet files consist of row groups, headers, footers while row groups contain respective columns of data and metadata; this arrangement makes Parquet a self-describing format that is well-optimized for quick query fetching and high-performance benchmarks.
- Compression: Parquet provides users the functionality to compress the data files to minimize the processing load as well as storage requirements. As the compression function is performed column by column, different sorts of encoding can be used for string and integer data types. Data can be compressed by three frameworks in Parquet: Dictionary Encoding, Bit Packing, and Run-length Encoding.
What are the types of Data that can be Imported into Google BigQuery?
You can easily load data into Google BigQuery from a local file as a batch operation or via Cloud Storage. These are the different source data formats that are compatible with Google BigQuery:
- ORC
- JSON (newline-delimited)
- Avro
- Comma-Separated Values
- Firestore exports kept in Google BigQuery
- Parquet
Apart from these, you can also leverage BigQuery Data Transfer Service to set up repeating loads from Cloud Storage into Google BigQuery.
You need IAM permissions to load data into Google Biguery partitions and tables or run load jobs. Here are the IAM permissions you will need if you wish to load data into a BigQuery partition or table:
- roles/bigquery.admin (includes the bigquery.jobs.create permission)
- bigquery.user (includes the bigquery.jobs.create permission)
- roles/bigquery.dataEditor
- bigquery.jobUser (includes the bigquery.jobs.create permission)
- roles/bigquery.dataOwner
Conclusion
This blog talks about the different steps you can follow to set up BigQuery Parquet Integration in a seamless fashion. It also gives a brief introduction to Parquet and BigQuery before diving into the BigQuery Parquet Integration steps.
FAQ on BigQuery Parquet data from Cloud Storage
How to import a parquet file in BigQuery?
To import a parquet file in BigQuery:
– Go to the BigQuery console.
– Click on the dataset where you want to load the file.
– Click “Create Table”.
– In the “Source” section, select “Upload” and choose your Parquet file.
– In the “File format” section, select “Parquet.”
– Configure the destination table and click “Create Table.”
How do you load data into BigQuery?
You can load data into BigQuery using two methods:
– Using the Console
– Using bq command-line tool
Can we query Parquet files?
You can directly query Parquet files stored in Google Cloud Storage using an external table or a federated query.
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!
Amit is a Content Marketing Manager at Hevo Data. He is passionate about writing for SaaS products and modern data platforms. His portfolio of more than 200 articles shows his extraordinary talent for crafting engaging content that clearly conveys the advantages and complexity of cutting-edge data technologies. Amit’s extensive knowledge of the SaaS market and modern data solutions enables him to write insightful and informative pieces that engage and educate audiences, making him a thought leader in the sector.