Venturing into Data Science and deciding on a tool to use to solve a given problem can be challenging at times especially when you have a wide array of choices. In this age of data transformation where organizations are constantly seeking out ways to improve the day to day handling of data being produced and looking for methods to minimize the cost of having these operations, it has become imperative to handle such data transformations in the Cloud as it is a lot easier to manage and is also cost-efficient.
Data Warehousing architectures have rapidly changed over the years and most of the notable service providers are now Cloud-based. Therefore, companies are increasingly on the move to align with such offerings on the Cloud as it provides them with a lower upfront cost, and enhances scalability, and performance as opposed to traditional On-premise Data Warehousing systems. Google BigQuery is one of the well-known and widely accepted Cloud-based Data Warehouse Applications.
With the advent of Big Data, came up with Cloud applications like Hadoop and Spark. Both are developed by Apache Software Foundation, which are widely used Open-source frameworks for Big Data architectures. Each framework contains an extensive ecosystem of open-source technologies that prepare, process, manage and analyze big data sets. The BigQuery Connector is a library that allows Spark and Hadoop applications to analyze BigQuery data and write data to BigQuery using BigQuery’s native terminology.
In this article, you will gain information about Spark BigQuery Connector. You will also gain a holistic understanding of Google BigQuery, Apache Spark, Google Storage API, their key features and the steps to be followed to set up Spark BigQuery Connector. Read along to find out in-depth information about Spark BigQuery Connector.
What is Apache Spark?
Apache Spark, created by a set of Ph.D. understudies at UC Berkeley in 2009, is a unified analytic tool containing multiple libraries for Big Data processing designed with distinctive Streaming Modules, Structured Query Language, Machine Learning, and Graph Handling. Simple APIs in Apache Spark can process significant information, while the end-users scarcely need to think about the task and resource management over machines, which is entirely done by Apache Spark in its engine.
Apache Spark is designed to work at a fast processing speed and perform general-purpose tasks. One of the main highlights of Apache Spark is its capacity to run computations of large Datasets in memory. Yet, the framework is likewise more proficient than MapReduce for complex apps running in memory.
Apache Spark covers a broad scope of workloads as a general-purpose tool that usually requires separate distributed systems. Spark makes it economical and straightforward to consolidate distinctive processing types by covering these workloads in a similar engine, which is essential for producing Data Analysis Pipelines.
To have further information about Apache Spark, follow the Official Documentation.
Key Features of Apache Spark
Some of the key features of Apache Spark are as follow:
- Performance – Apache Spark is well-known for its speed since it processes data in-memory (RAM). Apache Spark’s processing speed delivers near Real-Time Analytics.
- Ease of Use – Apache Spark comes with in-built APIs for Scala, Java, and Python, and it also includes Spark SQL (formerly called Shark) for SQL users. It has simple building blocks, which makes it easy for users to write user-defined functions.
- Data Processing Capabilities – Apache Spark can process graphs and also comes with its own Machine Learning Library called MLlib. Due to its high-performance capabilities, you can use it for Batch Processing as well as near Real-Time Processing.
- Fault Tolerance – Apache Spark relies on speculative execution and retries for every task which relies on RAM.
- Security – Apache Spark supports authentication for RPC channels via a shared secret. It also supports event logging as a feature, and you can secure Web User Interfaces via Javax Servlet Filters. Additionally, it can use HDFS File Permissions, Kerberos Authentication, and encryption between nodes.
- Scalability – Apache Spark offers scalability through HDFS. Apache Spark uses Random Access Memory (RAM) for optimal performance setup.
Hevo Data, an Automated No Code Data Pipeline, helps you transfer data from 100+ sources (40+ free sources) to Google BigQuery in a completely hassle-free manner. Hevo is fully managed and completely automates the process of not only loading your data into BigQuery but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code.
Get Started With Hevo For Free
With Hevo, you can load data from a multitude of sources into BigQuery seamlessly and make it your single source of truth. You can also leverage Hevo’s Data Mapping feature to ensure that your Google BigQuery data is up-to-date.
What is Google BigQuery?
Google BigQuery is a Cloud-based Data Warehouse that provides a Big Data Analytic Web Service for processing petabytes of data. It is intended for analyzing data on a large scale. It consists of two distinct components: Storage and Query Processing. It employs the Dremel Query Engine to process queries and is built on the Colossus File System for storage. These two components are decoupled and can be scaled independently and on-demand.
Google BigQuery is fully managed by Cloud service providers. We don’t need to deploy any resources, such as discs or virtual machines. It is designed to process read-only data. Dremel and Google BigQuery use Columnar Storage for quick data scanning, as well as a tree architecture for executing queries using ANSI SQL and aggregating results across massive computer clusters. Furthermore, owing to its short deployment cycle and on-demand pricing, Google BigQuery is serverless and designed to be extremely scalable.
For further information about Google Bigquery, follow the Official Documentation.
Key Features of Google BigQuery
Some of the key features of Google BigQuery are as follows:
- Performance – Partitioning is supported by BigQuery, which improves query performance. The data may be readily queried using SQL or Open Database Connectivity (ODBC).
- Scalability – Being quite elastic, BigQuery separates computation and storage, allowing customers to scale processing and memory resources according to their needs. The tool has significant vertical and horizontal scalability.
- Security – When a third-party authorization exists, users can utilize OAuth as a standard approach to get the cluster. By default, all data is encrypted and in transit. Cloud Identity and Access Management (IAM) allows for fine-tuning administration.
- Usability – Google BigQuery is a highly user-friendly platform that requires a basic understanding of SQL commands, ETL tools, etc.
- Data Types – It supports JSON and XML file formats.
- Data Loading – It employs the conventional ELT/ETL Batch Data Loading techniques by employing standard SQL dialect, as well as Data Streaming to load data row by row using Streaming APIs.
- Integrations – In addition to operational databases, the system supports integration with a wide range of data integration tools, business intelligence (BI), and artificial intelligence (AI) solutions. It also works with Google Workspace and Cloud Platform.
- Data Recovery – Data backup and disaster recovery are among the services provided by Google BigQuery. Users can query point-in-time snapshots of data changes from the last seven days.
Understanding Apache Spark BigQuery Connector
The Spark BigQuery Connector is used with Apache Spark to read and write data from and to BigQuery. The connector can read Google BigQuery tables into Spark DataFrames and write DataFrames back to BigQuery. This is accomplished by communicating with BigQuery using the Spark SQL Data Source API.
The BigQuery Storage Read API streams data from BigQuery in parallel over gRPC without the need for Google Cloud Storage as an intermediary.
Key Features of BigQuery Storage Read API
Some of the key features of BigQuery Storage API are as follows:
1) Multiple Streams
Users can use the Storage Read API to read disjoint sets of rows from a table using multiple streams during a session. Consumption from distributed processing frameworks or independent consumer threads within a single client is facilitated by this.
2) Column Projection
Users can choose an optional subset of columns to read while creating a session. When tables have a large number of columns, this allows for more efficient reads.
3) Column Filtering
Users can specify basic filter predicates to enable data filtration on the server side before transmitting it to a client.
4) Snapshot Consistency
Storage sessions are read using a snapshot isolation model. Every customer reads based on a specific point in time. The session creation time is used as the default snapshot time, although consumers can access data from an earlier snapshot.
For further information on Google BigQuery Storage Read API, follow the Official Documentation.
Requirements to Set up Spark BigQuery Connector
The requirements to be taken care of before moving forward with setting up Spark BigQuery Connector are as follows:
1) Enable the BigQuery Storage API
The Storage Read API is distinct from the BigQuery API and appears separately as the BigQuery Storage API in the Google Cloud Console. The Storage Read API, on the other hand, is enabled in all projects where the BigQuery API is enabled so, no further activation steps are required.
2) Create a Google Cloud Dataproc Cluster (Optional)
If you don’t have an Apache Spark environment, you can set up a Cloud Dataproc cluster with pre-configured authentication. Instead of Cloud Dataproc,
spark-submit may be used on any cluster.
The ‘BigQuery’ or ‘Cloud-platform’ scopes are required for every Dataproc cluster that uses the API. Dataproc clusters by default have the ‘BigQuery’ scope, therefore most clusters in enabled projects should work by default, for example.
Steps to Set Up Spark BigQuery Connector
The Spark BigQuery Connector uses the cross-language Spark SQL Data Source API.
The steps followed to set up Spark BigQuery Connector are as follows:
Step 1: Providing the Spark BigQuery Connector to your Application
The Spark BigQuery Connector must be available to your application at runtime. This can be achieved in one of the following ways:
- Whenever you create your Cluster, install the Spark BigQuery Connector in the Spark jars directory of every node by using the Dataproc connectors initialization action.
- You can add the Spark BigQuery Connector at runtime using the
--jars parameter, which can be used with the Dataproc API or
- If you are using Dataproc image 1.5 and above, you can add the following parameter:
- If you are using Dataproc image 1.4 or below, you can add the following parameter:
- Include the jar in your Scala or Java Spark application as a dependency and can refer to Compiling against the Spark BigQuery Connector.
If the Spark BigQuery Connector is not available at runtime, a
ClassNotFoundException is thrown.
For further information on Spark BigQuery Connector availability, visit here.
Step 2: Reading Data from a BigQuery Table
For reading data from a BigQuery table, you can refer to the following code blocks.
df = spark.read
or the Scala only implicit API:
val df = spark.read.bigquery("bigquery-public-data.samples.shakespeare")
For more information on reading data from BigQuery tables, you can visit here.
Step 3: Reading Data from a BigQuery Query
The Spark BigQuery Connector lets you execute any Standard SQL SELECT query on BigQuery and have the results sent directly to a Spark Dataframe. This is simple to accomplish, as demonstrated by the following code sample:
sql = """
SELECT tag, COUNT(*) c
SELECT SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(YEAR FROM creation_date)>=2014
), UNNEST(tags) tag
GROUP BY 1
ORDER BY 2 DESC
df = spark.read.format("bigquery").load(sql)
And the above code yields the following result:
A second option is to use the Query option in the following way:
df = spark.read.format("bigquery").option("query", sql).load()
The execution is faster as only the result is transmitted over the wire. In a similar way, the queries can include JOINs more efficiently than running joins on Spark or use other BigQuery features such as Subqueries, BigQuery User-defined Functions, Wildcard Tables, BigQuery ML, etc.
In order to use this feature the following configurations MUST be set:
- “viewsEnabled” must be set to true.
- “materializationDataset” must be set to a dataset where the GCP user has table creation permission. “materializationProject” is optional.
Fur further information on reading data from BigQuery query, visit here.
Step 4: Writing Data to BigQuery
Writing a DataFrame to BigQuery is done in similar ways as above. You can observe that the process first uploads the data to GCS before loading it into BigQuery; a GCS bucket must be created to specify the temporary data placement.
The data is stored temporarily in the Apache parquet format. Apache ORC is an alternative format.
The GCS bucket and the format can also be set globally using Spark”s RuntimeConfig in the following manner:
While streaming a DataFrame to BigQuery, each batch is written in the same way as a non-streaming DataFrame.
Note that an HDFS compatible checkpoint location (eg:
gs://checkpoint-bucket/checkpointDir) must be specified.
With Hevo Data you can seamlessly write all your data from a variety of sources to BigQuery without having to write a single line of code.
In this article, you have learned about Google BigQuery, Apache Spark, and their key features. This article also provided information on Spark BigQuery Connector, BigQuery Storage Read API and the steps followed to set up Spark BigQuery Connector. Companies store valuable data from multiple data sources into Google BigQuery. However, when it comes to loading with BigQuery, you need to be an expert to set up ETL pipelines from scratch and manually configure several details. Moreover, most of the time, the data is not available in the right format and you will need data engineering and BigQuery administration skills to transform the data.
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly load data to BigQuery in real-time. Furthermore, Hevo’s fault-tolerant architecture ensures a consistent and secure transfer of your data to BigQuery. Using Hevo will make your life easier and make Data Transfer hassle-free.
Get Started with Hevo for Free
Share your experience of understanding the Spark BigQuery Connector in the comment section below! We would love to hear your thoughts.