Analyzing vast volumes of data can be challenging. Google BigQuery is a powerful tool that enables you to store, process, and analyze large datasets with ease. However, it may only provide some of the functionalities and tools needed for complex analysis. This is where Databricks steps in. It offers a variety of tools, such as interactive notebooks, ML tools, clusters, and more that help you simplify complex tasks. By integrating BigQuery into Databricks, you can combine data warehousing capabilities with advanced analytics tools to enhance your data performance. 

In this article, you will discover how to sync BigQuery to Databricks using different methods.

Why Integrate BigQuery to Databricks? 

BigQuery Databricks integration significantly benefits an organization’s data analytics capabilities.

  • Databricks is a unified platform for data analytics and engineering, enabling a centralized workflow for data processing.
  • The optimized Apache Spark clusters in Databricks can process large datasets quickly and also allow you to allocate resources dynamically based on work needs. This results in fast query execution, reduced time to gain insight, and efficient resource utilization.
  • Databricks offers advanced analysis tools and libraries like MLflow, SparkSQL, and more. 

Overview of BigQuery 

BigQuery is a fully managed serverless data warehouse developed by Google and offered as a Platform as a Service (PaaS). It supports SQL querying, making it accessible and easy to query and retrieve data. It seamlessly integrates with different Google Cloud services, such as Google Cloud Storage (GCS), Google Sheets, and more, simplifying data visualization and workflows. It is a powerful warehouse solution with high scalability and data querying capabilities. 

Overview of Databricks

Databricks is a unified platform. It offers optimized Apache Spark clusters, interactive notebooks, and a collaborative workspace where you can efficiently work with your team on data projects. Databricks seamlessly integrates with varied data sources and services such as AWS, Azure, SQL Server, and third-party tools, allowing integration within existing infrastructure. It also has in-built ML and AI capabilities that help you manage, develop, and deploy end-to-end ML workflows and models. Overall, Databricks is a platform that enhances your ability to make informed business decisions and make innovations in your workflow.

How to Migrate BigQuery to Databricks

You can load data from BigQuery to Databricks using two methods.

Method 1: Load BigQuery to Databricks Using Hevo

Hevo is a real-time ELT platform that helps you smoothly integrate data from BigQuery to Databricks through its no-code, automated, and flexible data pipeline. It offers 150+ data sources to export and ingest data into your desired destinations. You can also transform and enrich your data, making it analysis-ready in no time.

Benefits of Using Hevo for BigQuery Databricks Integration

  • Automated Schema Mapping: Hevo’s automated schema mapping automatically reads and replicates the source data’s schema and ingests it into your destination. 
  • Data Transformation: Hevo’s data transformation tools enable you to clean and enrich your source data before loading it into your destination. 
  • Incremental Loading: Incremental data loading allows you to load the updated and modified data to your targeted system instead of replicating the whole dataset again.

Let’s look at the steps to load data from BigQuery to Databricks by building and deploying a data pipeline in Hevo.

Step 1: Configure BigQuery as Your Source

Prerequisites:

  • Hevo needs permission to access data on your Google Cloud Console. Make sure you authenticate your permissions. 
  • You need to access a BigQuery project that consists of one or more datasets with at least one table inside that dataset. 
  • You must have an active billing account that should be linked to your GCP project. 
  • You must have permissions for roles such as Data Editor, Data Viewer, and Job User. 
  • You must have access to an existing GCP bucket in the BigQuery location for your datasets.
  • To create a data pipeline in Hevo, you must be assigned the role of Team Collaborator, Team Administrator, or Pipeline Administrator.

Follow the steps below to configure the source:

  • Click on PIPELINES in the Navigation Bar. 
  • Click on +CREATE in the Pipelines View List. 
  • On the source select page, search and select Google BigQuery as your source.
  • On Configure your BigQuery Account page,  connect your BigQuery warehouse in one of the following ways: 
  1. Connect to the User Account. 
  2. Connect to the Service Account.
  • On Configure your BigQuery as Source page, specify the mandatory details to build the connection.
BigQuery to Databricks: Configure Source Settings
BigQuery to Databricks: Configure Source Settings

To learn more about configuring BigQuery as your source, read the Hevo Documentation. 

Step 2: Set up Databricks as Your Destination

Prerequisites:

  • An active AWS, GCP, or Azure account should be available.
  • Create a Databricks workspace on your cloud service account.
  • Enable the IP access list feature to connect your Databricks workspace.
  • The URL of the Databricks workspace must be available in the format –  https://<deployment name>.cloud.databricks.com.
  • To connect to your Databricks workspace, you must meet the following requirements: 

You can connect Databricks as your destination using a recommended method: Databricks Partner Connector.

Follow the steps to configure your destination:

  • Select DESTINATIONS from the Navigation Bar. 
  • Click on +CREATE in the Destination View Lists. 
  • Select Databricks as your destination on the Add Destination Page. 
  • On the Configure Databricks as your Destination page, specify the following details.
BigQuery to Databricks: Configure your Destination Settings
BigQuery to Databricks: Configure your Destination Settings

For more information, refer to the Hevo documentation on configuring Databricks as the destination.

Get Started with Hevo for Free

Method 2: Read and Query BigQuery Data in Databricks Using Google Cloud Console

In this method, you will learn: 

  • How to read BigQuery in Databricks. 
  • How to convert BigQuery to Databricks table.

Prerequisites:

Step 1: Set up Your Google Cloud 

You have to set up your Google Cloud Storage by following the steps given below: 

Enable Your BigQuery Storage API

The BigQuery API is enabled by default for all the Google projects where BigQuery is enabled. However, you need to enable your BigQuery API for the Google project you are using to connect with the Databricks.

You can use any of the methods below to enable your BigQuery API to set up a connection between BigQuery and Databricks.

Create a Google Service Account

You need to create a Google service account for your Databricks cluster to read, write, and query tasks. 

There are two methods to create a service account for your Databricks cluster. For both methods, you need to generate or add keys for safe authentication:

Create a Google Storage Bucket for Temporary Storage

You must create a Google Storage Bucket to write in the BigQuery table. 

  • Click on Storage in the navigation bar on your Google Cloud Platform. 
  • Click on CREATE BUCKET. 
BigQuery to Databricks: Creating Google Storage Bucket
BigQuery to Databricks: Creating Google Storage Bucket
  • Configure the Bucket Settings and click on CREATE. 
BigQuery to Databricks: Configure Google Bucket Settings
BigQuery to Databricks: Configure Google Bucket Settings
  • Click on the Permissions tab, add members, and provide permissions to the service account on the bucket. Then click on SAVE. 
BigQuery to Databricks: Adding Permissions to the Google Bucket
BigQuery to Databricks: Adding Permissions to the Google Bucket

Step 2: Set up the Databricks

You need to configure the Databricks cluster to access your data in the BigQuery table. For this, you should provide a JSON key file as Spark configuration.

Add the following configuration on the Spark Configuration Tab, replacing <base64-keys> with the string of your encoded JSON file key for Base64. 

credentials <base64-keys>

spark.hadoop.google.cloud.auth.service.account.enable true

spark.hadoop.fs.gs.auth.service.account.email <client-email>

spark.hadoop.fs.gs.project.id <project-id>

spark.hadoop.fs.gs.auth.service.account.private.key <private-key>

spark.hadoop.fs.gs.auth.service.account.private.key.id <private-key-id>

Step 3: Read a BigQuery Table in Databricks

Open your Databricks workspace and select a Databricks notebook, which you can use to execute the following code to read BigQuery in Databricks.

df = spark.read.format("bigquery") \

  .option("table",<table-name>) \

  .option("project", <project-id>)  \

  .option("parentProject", <parent-project-id>) \

  .load()

Step 4: Write in a BigQuery Table 

To write data in the BigQuery table, execute the following command. 

df.write.format("bigquery") \

  .mode("<mode>") \

  .option("temporaryGcsBucket", "<bucket-name>") \

  .option("table", <table-name>) \

  .option("project", <project-id>) \

  .option("parentProject", <parent-project-id>) \

  .save()

In the above code, <bucket-name> is the name of the bucket you created for temporary storage.

Limitations for Method: Read and Query BigQuery Data in Databricks Using Google Cloud Console

  • You need to understand the functionalities and terminologies within BigQuery and Databricks. If you are still getting familiar with the ecosystems, setting up the connection between BigQuery and Databricks may be difficult
  • You can’t directly transfer data from BigQuery to Databricks; a Google bucket must be used for temporary storage, making the process lengthy and time-consuming
  • Managing the authorizations and permissions between BigQuery, Google Cloud Console, and Databricks requires careful configuration. Data security issues might arise if you don’t provide the correct permissions or credentials.

Use Cases

  • Using machine learning models in Databricks, you can perform real-time risk assessments and detect fraudulent activities.
  • You can analyze your customer data in Databricks and build predictive models to identify customer churn rates and generate retention strategies. 
  • Databricks machine learning tools will help you build forecasting models that can be used to optimize inventory levels and reduce lead times in the supply chain. 

Conclusion

BigQuery and Databricks are robust and highly scalable serverless platforms that allow you to store and analyze your data sets effortlessly. Integrating data from BigQuery to Databricks facilitates seamless data management and enables advanced analysis.

You can combine data between these platforms using Google Cloud Console or Hevo. While Google Cloud Console can be a little complex, with Hevo’s flexible no-code data pipeline, you can smoothly ingest data from source to destination in a minimal amount of time

Overall, BigQuery Databricks integration is a transformative step towards unlocking the value of data analytics and driving business value.

FAQs (Frequently Asked Questions)

Q. How can you load Google BigQuery data in a Dataframe on Databricks? 

You can load your Google BigQuery data in a data frame on Databrick using a Python or Scala notebook.

Q. Can BigQuery Connector connect data from Google BigQuery to Spark on Databricks? 

BigQuery Connector is a client-side library connector that uses BigQuery API. Yes, you can use the BigQuery Connector to read data from Google BigQuery into Apache Spark, which runs on Databricks. 

Saloni Agarwal
Technical Content Writer, Hevo Data

With a strong background in market research for data science and cybersecurity products, Saloni is an expert at crafting informative articles on key topics within the data science domain, such as data transformation, processes, and analysis. Saloni's passion for the field drives her to continually learn and stay abreast of emerging technologies and trends, ensuring her contributions are impactful. Her work aims to enrich the discourse in data science, providing valuable insights and fostering a deeper understanding of complex subjects.

All your customer data in one place.