Spark BigQuery Connector: Easy Steps to Integrate, Optimize & Analyze Data 101

on Apache Spark, Big Data, Data Warehouse, Google BigQuery • September 9th, 2021 • Write for Hevo

Venturing into Data Science and deciding on a tool to use to solve a given problem can be challenging at times especially when you have a wide array of choices. In this age of data transformation where organizations are constantly seeking out ways to improve the day to day handling of data being produced and looking for methods to minimize the cost of having these operations, it has become imperative to handle such data transformations in the Cloud as it is a lot easier to manage and is also cost-efficient.

Data Warehousing architectures have rapidly changed over the years and most of the notable service providers are now Cloud-based. Therefore, companies are increasingly on the move to align with such offerings on the Cloud as it provides them with a lower upfront cost, enhances scalability, and performance as opposed to traditional On-premise Data Warehousing systems. Google BigQuery is among one of the well-known and widely accepted Cloud-based Data Warehouse Applications.

With the advent of Big Data, came up with Cloud applications like Hadoop and Spark. Both are developed by Apache Software Foundation, which are widely used Open-source frameworks for Big Data architectures. Each framework contains an extensive ecosystem of open-source technologies that prepare, process, manage and analyze big data sets. The BigQuery Connector is a library that allows Spark and Hadoop applications to analyze BigQuery data and write data to BigQuery using BigQuery’s native terminology.

In this article, you will gain information about Spark BigQuery Connector. You will also gain a holistic understanding of Google BigQuery, Apache Spark, Google Storage API, their key features and the steps to be followed to set up Spark BigQuery Connector. Read along to find out in-depth information about Spark BigQuery Connector.

Table of Contents

Introduction to Apache Spark

Spark BigQuery Connector -Apache Spark Logo
Image Source

Apache Spark, created by a set of Ph.D. understudies at UC Berkeley in 2009, is a unified analytic tool containing multiple libraries for Big Data processing designed with distinctive Streaming Modules, Structured Query Language, Machine Learning, and Graph Handling. Simple APIs in Apache Spark can process significant information, while the end-users scarcely need to think about the task and resource management over machines, which is entirely done by Apache Spark in its engine.

Apache Spark is designed to work at a fast processing speed and perform general-purpose tasks. One of the main highlights of Apache Spark is its capacity to run computations of large Datasets in memory. Yet, the framework is likewise more proficient than MapReduce for complex apps running in memory.

Apache Spark covers a broad scope of workloads as a general-purpose tool that usually requires separate distributed systems. Spark makes it economical and straightforward to consolidate distinctive processing types by covering these workloads in a similar engine, which is essential for producing Data Analysis Pipelines.

To have further information about Apache Spark, follow the Official Documentation.

Key Features of Apache Spark

Spark BigQuery Connector - Features of Spark
Image Source

Some of the key features of Apache Spark are as follows:

1) Performance

Apache Spark is well-known for its speed since it processes data in-memory (RAM). Apache Spark’s processing speed delivers near Real-Time Analytics, making it a suitable tool for IoT sensors, Credit Card Processing Systems, Marketing Campaigns, Security Analytics, Machine Learning, Social Media Sites, and Log Monitoring

2) Ease of Use

Apache Spark comes with in-built APIs for Scala, Java, and Python, and it also includes Spark SQL (formerly called Shark) for SQL users. Apache Spark also has simple building blocks, which makes it easy for users to write user-defined functions. You can use Apache Spark in interactive mode to get immediate feedback when running commands. 

3) Data Processing Capabilities

With Apache Spark, you can do more than just plain data processing. Apache Spark can process graphs and also comes with its own Machine Learning Library called MLlib. Due to its high-performance capabilities, you can use Apache Spark for Batch Processing as well as near Real-Time Processing. Apache Spark is a “one size fits all” platform that can be used to perform all tasks instead of splitting tasks across different platforms. 

4) Fault Tolerance

Apache Spark relies on speculative execution and retries for every task which relies on RAM. 

5) Security

Apache Spark has security set to “OFF” by default, which can make you vulnerable to attacks. Apache Spark supports authentication for RPC channels via a shared secret. It also supports event logging as a feature, and you can secure Web User Interfaces via Javax Servlet Filters. Additionally, since Apache Spark can run on Yarn and use HDFS features, it can use HDFS File Permissions, Kerberos Authentication, and encryption between nodes.

6) Scalability

Since Big Data keeps on growing, Cluster sizes should increase in order to maintain throughput expectations. Apache Spark offers scalability through HDFS. Apache Spark uses Random Access Memory (RAM) for optimal performance setup.

7) Cost

Apache Spark is an open-source platform, and it comes for free. However, you have to invest in hardware and personnel or outsource the development. This means you will incur the cost of hiring a team that is familiar with the Cluster administration, software and hardware purchases, and maintenance. 

Introduction to Google BigQuery

Spark BigQuery Connector - Google BigQuery
Image Source

Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.

Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is serverless and designed to be extremely scalable.

For further information about Google Bigquery, follow the Official Documentation.

Key Features of Google BigQuery

Spark BigQuery Connector - Features of BigQuery
Image Source

Some of the key features of Google BigQuery are as follows:

1) Performance

Partitioning is supported by BigQuery, which improves Query performance. The data may be readily queried using SQL or Open Database Connectivity (ODBC)

2) Scalability

Being quite elastic, BigQuery separates computation and storage, allowing customers to scale processing and memory resources according to their needs. The tool has significant vertical and horizontal scalability and runs real-time queries on petabytes of data in a very short period.

3) Security

When a third-party authorization exists, users can utilize OAuth as a standard approach to get the cluster. By default, all data is encrypted and in transit. Cloud Identity and Access Management (IAM) allows for fine-tuning administration.

4) Usability

Google BigQuery is a highly user-friendly platform that requires a basic understanding of SQL commands, ETL tools, etc.

5) Data Types

 It supports JSON and XML file formats.

6) Data Loading

It employs the conventional ELT/ETL Batch Data Loading techniques by employing standard SQL dialect, as well as Data Streaming to load data row by row using Streaming APIs.

7) Integrations

In addition to operational databases, the system supports integration with a wide range of data integration tools, business intelligence (BI), and artificial intelligence (AI) solutions. It also works with Google Workspace and Cloud Platform.

8) Data Recovery

Data backup and disaster recovery are among the services provided by Google BigQuery. Users can query point-in-time snapshots of data changes from the last seven days.

9) Pricing Models

The Google BigQuery platform is available in both on-demand and flat-rate subscription models. Although data storage and querying will be chargedexporting, loading, and copying data is free. It has separated computational resources from storage resources. You are only charged when you run queries. The quantity of data processed during searches is billed.

Simplify BigQuery ETL and Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 30+ free sources) to a Data Warehouse such as Google BigQuery or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 30+ free sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Understanding Apache Spark BigQuery Connector

Spark BigQuery Connector
Image Source

The Spark BigQuery Connector is used with Apache Spark to read and write data from and to BigQuery. The connector can read Google BigQuery tables into Spark DataFrames and write DataFrames back to BigQuery. This is accomplished by communicating with BigQuery using the Spark SQL Data Source API.

The BigQuery Storage Read API streams data from BigQuery in parallel over gRPC without the need for Google Cloud Storage as an intermediary.

Key Features of BigQuery Storage Read API

Some of the key features of BigQuery Storage API are as follows:

1) Multiple Streams

Users can use the Storage Read API to read disjoint sets of rows from a table using multiple streams during a session. Consumption from distributed processing frameworks or independent consumer threads within a single client is facilitated by this.

2) Column Projection 

Users can choose an optional subset of columns to read while creating a session. When tables have a large number of columns, this allows for more efficient reads.

3) Column Filtering

Users can specify basic filter predicates to enable data filtration on the server side before transmitting it to a client.

4) Snapshot Consistency

Storage sessions are read using a snapshot isolation model. Every customer reads based on a specific point in time. The session creation time is used as the default snapshot time, although consumers can access data from an earlier snapshot.

For further information on Google BigQuery Storage Read API, follow the Official Documentation.

Requirements to Set up Spark BigQuery Connector

The requirements to be taken care of before moving forward with setting up Spark BigQuery Connector are as follows:

1) Enable the BigQuery Storage API

The Storage Read API is distinct from the BigQuery API and appears separately as the BigQuery Storage API in the Google Cloud Console. The Storage Read API, on the other hand, is enabled in all projects where the BigQuery API is enabled so, no further activation steps are required.

2) Create a Google Cloud Dataproc Cluster (Optional)

If you don’t have an Apache Spark environment, you can set up a Cloud Dataproc cluster with pre-configured authentication. Instead of  Cloud Dataproc, spark-submit may be used on any cluster.

The ‘BigQuery’ or ‘Cloud-platform’ scopes are required for every Dataproc cluster that uses the API. Dataproc clusters by default have the ‘BigQuery’ scope, therefore most clusters in enabled projects should work by default, for example.

Steps to Set Up Spark BigQuery Connector

The Spark BigQuery Connector uses the cross-language Spark SQL Data Source API.

The steps followed to set up Spark BigQuery Connector are as follows:

Step 1: Providing the Spark BigQuery Connector to your Application

The Spark BigQuery Connector must be available to your application at runtime. This can be achieved in one of the following ways:

  • Whenever you create your Cluster, install the Spark BigQuery Connector in the Spark jars directory of every node by using the Dataproc connectors initialization action.
  • You can add the Spark BigQuery Connector at runtime using the --jars parameter, which can be used with the Dataproc API or spark-submit.
    • If you are using Dataproc image 1.5 and above, you can add the following parameter:
      --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
    • If you are using Dataproc image 1.4 or below, you can add the following parameter:
      --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
  • Include the jar in your Scala or Java Spark application as a dependency and can refer to Compiling against the Spark BigQuery Connector.

If the Spark BigQuery Connector is not available at runtime, a ClassNotFoundException is thrown.

For further information on Spark BigQuery Connector availability, visit here.

Step 2: Reading Data from a BigQuery Table

For reading data from a BigQuery table, you can refer to the following code blocks.

df = spark.read 
  .format("bigquery") 
  .load("bigquery-public-data.samples.shakespeare")

or the Scala only implicit API:

import com.google.cloud.spark.bigquery._
val df = spark.read.bigquery("bigquery-public-data.samples.shakespeare")

For more information on reading data from BigQuery tables, you can visit here.

Step 3: Reading Data from a BigQuery Query

The Spark BigQuery Connector lets you execute any Standard SQL SELECT query on BigQuery and have the results sent directly to a Spark Dataframe. This is simple to accomplish, as demonstrated by the following code sample:

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")

sql = """
  SELECT tag, COUNT(*) c
  FROM (
    SELECT SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(YEAR FROM creation_date)>=2014
  ), UNNEST(tags) tag
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10
  """
df = spark.read.format("bigquery").load(sql)
df.show()

And the above code yields the following result:

A second option is to use the Query option in the following way:

df = spark.read.format("bigquery").option("query", sql).load()

The execution is faster as only the result is transmitted over the wire. In a similar way, the queries can include JOINs more efficiently than running joins on Spark or use other BigQuery features such as Subqueries, BigQuery User-defined Functions, Wildcard Tables, BigQuery ML, etc.

In order to use this feature the following configurations MUST be set:

  • viewsEnabled” must be set to true.
  • materializationDataset” must be set to a dataset where the GCP user has table creation permission. “materializationProject” is optional.

Fur further information on reading data from BigQuery query, visit here.

Step 4: Writing Data to BigQuery

Writing a DataFrame to BigQuery is done in similar ways as above. You can observe that the process first uploads the data to GCS before loading it into BigQuery; a GCS bucket must be created to specify the temporary data placement.

The data is stored temporarily in the Apache parquet format. Apache ORC is an alternative format.

The GCS bucket and the format can also be set globally using Spark”s RuntimeConfig in the following manner:

While streaming a DataFrame to BigQuery, each batch is written in the same way as a non-streaming DataFrame.

Note that an HDFS compatible checkpoint location (eg: path/to/HDFS/dir or gs://checkpoint-bucket/checkpointDir) must be specified.

df.writeStream 
  .format("bigquery") 
  .option("temporaryGcsBucket","some-bucket") 
  .option("checkpointLocation", "some-location") 
  .option("table", "dataset.table")

To have further information on writing data to Google BigQuery, visit here.

Conclusion

In this article, you have learned about Google BigQuery, Apache Spark, and their key features. This article also provided information on Spark BigQuery Connector, BigQuery Storage Read API and the steps followed to set up Spark BigQuery Connector.

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.

Visit our Website to Explore Hevo

Hevo Data with its strong integration with 100+ data sources (including 30+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as Google BigQuery, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools. 

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Share your experience of understanding the Spark BigQuery Connector in the comment section below! We would love to hear your thoughts.

No-code Data Pipeline for your Google BigQuery