Choosing a data science tool can be challenging due to the wide array of options available. Managing data transformations in the cloud is essential for cost-efficiency and ease of management.

Cloud-based Data Warehousing has become popular due to lower upfront costs, scalability, and performance enhancements compared to traditional on-premise systems. Google BigQuery is a widely accepted cloud-based Data Warehouse.

Big Data frameworks like Hadoop and Spark, developed by the Apache Software Foundation, are widely used open-source technologies for preparing, processing, managing, and analyzing large data sets. The BigQuery Connector allows Spark and Hadoop applications to interact with BigQuery.

This article provides information on the Spark BigQuery Connector, including an understanding of Google BigQuery, Apache Spark, Google Storage API, their key features, and setup steps for the Spark BigQuery Connector.

What is Apache Spark?

Spark BigQuery Connector -Apache Spark Logo
Image Source

Apache Spark, created by a set of Ph.D. understudies at UC Berkeley in 2009, is a unified analytic tool containing multiple libraries for Big Data processing designed with distinctive Streaming Modules, Structured Query Language, Machine Learning, and Graph Handling.

Simple APIs in Apache Spark can process significant information, while the end-users scarcely need to think about the task and resource management over machines, which is entirely done by Apache Spark in its engine.

Key Features of Apache Spark

Some of the key features of Apache Spark are as follow:

  • Performance – Apache Spark is well-known for its speed since it processes data in-memory (RAM). Apache Spark’s processing speed delivers near Real-Time Analytics.
  • Ease of Use – Apache Spark comes with in-built APIs for Scala, Java, and Python, and it also includes Spark SQL (formerly called Shark) for SQL users. It has simple building blocks, which makes it easy for users to write user-defined functions.
  • Data Processing Capabilities – Apache Spark can process graphs and also comes with its own Machine Learning Library called MLlib. Due to its high-performance capabilities, you can use it for Batch Processing as well as near Real-Time Processing.

What is Google BigQuery?

Spark BigQuery Connector - Google BigQuery
Image Source

Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing.

It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.

Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data.

Key Features of Google BigQuery

Some of the key features of Google BigQuery are as follows:

  • Performance – Partitioning is supported by BigQuery, which improves query performance. The data may be readily queried using SQL or Open Database Connectivity (ODBC)
  • Scalability – Being quite elastic, BigQuery separates computation and storage, allowing customers to scale processing and memory resources according to their needs. The tool has significant vertical and horizontal scalability.
  • Security – When a third-party authorization exists, users can utilize OAuth as a standard approach to get the cluster. By default, all data is encrypted and in transit. Cloud Identity and Access Management (IAM) allows for fine-tuning administration.
  • Usability – Google BigQuery is a highly user-friendly platform that requires a basic understanding of SQL commands, ETL tools, etc.

Understanding Apache Spark BigQuery Connector

Spark BigQuery Connector
Image Source

The Spark BigQuery Connector is used with Apache Spark to read and write data from and to BigQuery. The connector can read Google BigQuery tables into Spark DataFrames and write DataFrames back to BigQuery. This is accomplished by communicating with BigQuery using the Spark SQL Data Source API.

The BigQuery Storage Read API streams data from BigQuery in parallel over gRPC without the need for Google Cloud Storage as an intermediary.

Key Features of BigQuery Storage Read API

Some of the key features of BigQuery Storage API are as follows:

1) Multiple Streams

Users can use the Storage Read API to read disjoint sets of rows from a table using multiple streams during a session. Consumption from distributed processing frameworks or independent consumer threads within a single client is facilitated by this.

2) Column Projection 

Users can choose an optional subset of columns to read while creating a session. When tables have a large number of columns, this allows for more efficient reads.

3) Column Filtering

Users can specify basic filter predicates to enable data filtration on the server side before transmitting it to a client.

4) Snapshot Consistency

Storage sessions are read using a snapshot isolation model. Every customer reads based on a specific point in time. The session creation time is used as the default snapshot time, although consumers can access data from an earlier snapshot.

For further information on Google BigQuery Storage Read API, follow the Official Documentation.

Requirements to Set up Spark BigQuery Connector

The requirements to be taken care of before moving forward with setting up Spark BigQuery Connector are as follows:

1) Enable the BigQuery Storage API

The Storage Read API is distinct from the BigQuery API and appears separately as the BigQuery Storage API in the Google Cloud Console. The Storage Read API, on the other hand, is enabled in all projects where the BigQuery API is enabled so, no further activation steps are required.

2) Create a Google Cloud Dataproc Cluster (Optional)

If you don’t have an Apache Spark environment, you can set up a Cloud Dataproc cluster with pre-configured authentication. Instead of  Cloud Dataproc, spark-submit may be used on any cluster.

The ‘BigQuery’ or ‘Cloud-platform’ scopes are required for every Dataproc cluster that uses the API. Dataproc clusters by default have the ‘BigQuery’ scope, therefore most clusters in enabled projects should work by default, for example.

Steps to Set Up Spark BigQuery Connector

The Spark BigQuery Connector uses the cross-language Spark SQL Data Source API.

The steps followed to set up Spark BigQuery Connector are as follows:

Step 1: Providing the Spark BigQuery Connector to your Application

The Spark BigQuery Connector must be available to your application at runtime. This can be achieved in one of the following ways:

  • Whenever you create your Cluster, install the Spark BigQuery Connector in the Spark jars directory of every node by using the Dataproc connectors initialization action.
  • You can add the Spark BigQuery Connector at runtime using the --jars parameter, which can be used with the Dataproc API or spark-submit.
    • If you are using Dataproc image 1.5 and above, you can add the following parameter:
      --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
    • If you are using Dataproc image 1.4 or below, you can add the following parameter:
      --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
  • Include the jar in your Scala or Java Spark application as a dependency and can refer to Compiling against the Spark BigQuery Connector.

If the Spark BigQuery Connector is not available at runtime, a ClassNotFoundException is thrown.

For further information on Spark BigQuery Connector availability, visit here.

Step 2: Reading Data from a BigQuery Table

For reading data from a BigQuery table, you can refer to the following code blocks.

df = spark.read 
  .format("bigquery") 
  .load("bigquery-public-data.samples.shakespeare")

or the Scala only implicit API:

import com.google.cloud.spark.bigquery._
val df = spark.read.bigquery("bigquery-public-data.samples.shakespeare")

For more information on reading data from BigQuery tables, you can visit here.

Step 3: Reading Data from a BigQuery Query

The Spark BigQuery Connector lets you execute any Standard SQL SELECT query on BigQuery and have the results sent directly to a Spark Dataframe. This is simple to accomplish, as demonstrated by the following code sample:

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")

sql = """
  SELECT tag, COUNT(*) c
  FROM (
    SELECT SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(YEAR FROM creation_date)>=2014
  ), UNNEST(tags) tag
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10
  """
df = spark.read.format("bigquery").load(sql)
df.show()

And the above code yields the following result:

A second option is to use the Query option in the following way:

df = spark.read.format("bigquery").option("query", sql).load()

The execution is faster as only the result is transmitted over the wire. In a similar way, the queries can include JOINs more efficiently than running joins on Spark or use other BigQuery features such as Subqueries, BigQuery User-defined Functions, Wildcard Tables, BigQuery ML, etc.

In order to use this feature the following configurations MUST be set:

  • viewsEnabled” must be set to true.
  • materializationDataset” must be set to a dataset where the GCP user has table creation permission. “materializationProject” is optional.

Fur further information on reading data from BigQuery query, visit here.

Step 4: Writing Data to BigQuery

Writing a DataFrame to BigQuery is done in similar ways as above. You can observe that the process first uploads the data to GCS before loading it into BigQuery; a GCS bucket must be created to specify the temporary data placement.

The data is stored temporarily in the Apache parquet format. Apache ORC is an alternative format.

The GCS bucket and the format can also be set globally using Spark”s RuntimeConfig in the following manner:

While streaming a DataFrame to BigQuery, each batch is written in the same way as a non-streaming DataFrame.

Note that an HDFS compatible checkpoint location (eg: path/to/HDFS/dir or gs://checkpoint-bucket/checkpointDir) must be specified.

df.writeStream 
  .format("bigquery") 
  .option("temporaryGcsBucket","some-bucket") 
  .option("checkpointLocation", "some-location") 
  .option("table", "dataset.table")

With Hevo Data you can seamlessly write all your data from a variety of sources to BigQuery without having to write a single line of code.

Conclusion

In this article, you have learned about Google BigQuery, Apache Spark, and their key features.

This article also provided information on Spark BigQuery Connector, BigQuery Storage Read API and the steps followed to set up Spark BigQuery Connector. Companies store valuable data from multiple data sources into Google BigQuery.

Manisha Jena
Research Analyst, Hevo Data

Manisha Jena is a data analyst with over three years of experience in the data industry and is well-versed with advanced data tools such as Snowflake, Looker Studio, and Google BigQuery. She is an alumna of NIT Rourkela and excels in extracting critical insights from complex databases and enhancing data visualization through comprehensive dashboards. Manisha has authored over a hundred articles on diverse topics related to data engineering, and loves breaking down complex topics to help data practitioners solve their doubts related to data engineering.