Organizations today rely on the Confluent Cloud to stream real-time data using Apache Kafka. They can use this real-time data with any popular data warehouse like Google BigQuery for further analysis. Google BigQuery can further combine with powerful BI tools such as PowerBI, Tableau, Google Data Studio, and more to gain meaningful insights into Confluent cloud data, which helps organizations make effective data-driven decisions. Organizations can connect Confluent Cloud data with Google BigQuery using Confluent BigQuery Sink Connector or third-party ETL (extract, transform and load) tools.

In this article, you will learn to connect Confluent Cloud to BigQuery using the Confluent BigQuery Sink connector. 

Prerequisites

Basics of Apache Kafka

What is Confluent Cloud?

confluent cloud to bigquery: confluent cloud logo

Confluent Cloud is a full-scale data streaming platform based on Apache Kafka, where businesses can access, store, and manage data in real-time. It consists of a web interface, local command-line interface, schema Registry, Rest proxy, and more than 100 in-built Kafka connectors and KsqlDB for administrating streaming services.

With Confluent Cloud, organizations can drive business value from their data without worrying about manual processes like data transportation and data integration between different devices. Confluent Cloud simplifies the processes such as connecting data sources to Kafka, building streaming applications, securing, monitoring, and managing Kafka infrastructure.

Key Features of Confluent Cloud

  • Confluent Control Center: Confluent Control Center is a GUI-based application to manage and monitor Kafka. It allows organizations to manage Kafka Connect and easily create, edit, and manage connections with other systems. Organizations can monitor data streams from producer and consumer, ensure every message is delivered, and measure how long it takes to deliver these messages. Organizations can use Control Center to build a production data pipeline based on Kafka without writing a single line of code.
  • Self-Balancing: The Confluent Platform can deploy hundreds of brokers, manage thousands of Kafka topics and handle millions of messages per hour. It might sometimes happen that the brokers are lost due to network issues. As a result, new topics are created, and partitions are reassigned to balanced the workload. This results in generating more resource workload overhead.
  • Confluent Cluster Linking: Cluster linking in Confluent Cloud is a fully managed service to move data from one Confluent cluster to another. Cluster linking creates perfect copies of your Kafka topics and programmatically keeps data in-sync across clusters. It can quickly build multi-data centers, multi-region, and hybrid cloud deployments.

What is Google BigQuery?

Google BigQuery Logo

Developed in 2010, Google BigQuery is a highly scalable, fully managed data warehouse with many cloud functionalities, which helps organizations to store terabytes of data seamlessly. It leverages standard SQL queries for analyzing and obtaining answers from a colossal amount of data faster. Google BigQuery consists of columnar storage that helps businesses provide fast performance and high data compression capabilities.

BigQuery is also called SQL-based Data Warehouse as a Service (DWaaS) with zero infrastructure management. It is a serverless warehouse that does not require any upfront hardware provisioning or management.  BigQuery runs SQL Queries and all requests are to be authenticated. Google provides a complete package to their users with Big Data loading features on Google Cloud Storage and connectivity with various Google apps like Apps Script. Google BigQuery has many built-in features like Machine Learning and AI Capabilities, Geospatial Analysis, and Business Intelligence.

BigQuery uses a Columnar Storage format that is optimized for analytical queries to store data. BigQuery displays data in tables, rows, and columns, with full database transaction semantics support (ACID).

Google BigQuery allows developers and data scientists to work with programming languages such as Java, Python, Go, and JavaScript. They can also use BigQuery API to transform and manage data effectively.

Key Features of Google BigQuery

  • Machine Learning: Google BigQuery allows businesses to create and execute Machine Learning Models using standard SQL queries. Organizations can execute machine learning models like Linear Regression, Multi-class Regression, Binary Logistic Regression, K-means clustering, and more with BigQuery.
  • BigQuery BI Engine: BigQuery consists of a BI engine that helps businesses process large volumes of data with sub-second query response time and high concurrency. This BI engine can integrate with powerful tools such as Google Data Studio, Tableau, Power BI, and more to gain meaningful insights.
  • Real-time Analytics: BigQuery keeps businesses up to date with real-time transfer and analytics. It assists businesses in allocating resources independently to deliver the best performance and outcomes that allow organizations to generate business reports as required.
  • User Friendly: With BigQuery, organizations can store and analyze their data with simple clicks. The user-friendly web interface of BigQuery allows organizations to set up a cloud data warehouse easily without installing clusters, choosing storage sizes, encryption settings, and more.

Connecting Confluent Cloud to BigQuery

Method 1: Using Hevo to Set Up Confluent Cloud to BigQuery

Hevo provides an Automated No-code Data Pipeline that helps you move your Confluent Cloud data to BigQuery. Hevo is fully-managed and completely automates the process of not only loading data from your 150+ data sources(including 40+ free sources)but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Using Hevo, you can connect Confluent Cloud to BigQuery in the following 2 steps

  • Step 1: Configure Confluent Cloud as the Source in your Pipeline by following these steps:
    • Step 1.1: In the Asset Palette, select PIPELINES.
    • Step 1.2: In the Pipelines List View, click + CREATE.
    • Step 1.3: Select “Confluent Cloud” on the Select Source Type page.
    • Step 1.4: Enter the following information on the Configure your Invoiced Source page:
      • Pipeline Name: A distinct, 255-character-maximum name for the Pipeline.
      • Bootstrap Server(s): The server(s) used for bootstrapping, taken from Kafka Confluent Cloud. 
      • API Key: The API key that was retrieved from your Confluent Cloud account.
      • API Secret: Your Kafka Confluent Cloud account’s API secret.
configure confluent cloud as source
  • Step 1.5: Simply press TEST & CONTINUE.
  • Step 1.6: Configure the data ingestion and establish the destination after that.
  • Step 2: To set up Google BigQuery as a destination in Hevo, follow these steps:
    • Step 2.1: In the Asset Palette, select DESTINATIONS.
    • Step 2.2: In the Destinations List View, click + CREATE.
    • Step 2.3: Select Google BigQuery from the Add Destination page.
    • Step 2.4: Choose the BigQuery connection authentication method on the Configure your Google BigQuery Account page.
Configure your Google BigQuery Destination
  • Step 2.5: Choose one of these:
    • Using a Service Account to connect:
      • Service Account Key file, please attach.
      • Note that Hevo only accepts key files in JSON format.
      • Go to CONFIGURE GOOGLE BIGQUERY ACCOUNT and click it.
    • Using a user account to connect:
      • To add a Google BigQuery account, click +.
      • Become a user with BigQuery Admin and Storage Admin permissions by logging in.
      • To grant Hevo access to your data, click Allow.
Grant Access to Hevo
  • Step 2.6: Set the following parameters on the Configure your Google BigQuery page:
    • Destination Name: A unique name for your Destination.
    • Project ID: The BigQuery Project ID that you were able to retrieve in Step 2 above and for which you had permitted the previous steps.
    • Dataset ID: Name of the dataset that you want to sync your data to, as retrieved in Step 3 above.
    • GCS Bucket: To upload files to BigQuery, they must first be staged in the cloud storage bucket that was retrieved in Step 4 above.
    • Sanitize Table/Column Names: Activate this option to replace the spaces and non-alphanumeric characters in between the table and column names with underscores ( ). Name Sanitization is written.
  • Step 2.5: Click Test Connection to test connectivity with the Google BigQuery warehouse.
  • Step 2.6: Once the test is successful, click SAVE DESTINATION.

Method 2: Using Custom Code to Move Data from Confluent Cloud to BigQuery

This method of connecting Confluent Cloud to BigQuery uses a custom code approach. This method would be time-consuming and somewhat tedious to implement. Users will have to write custom codes to enable Confluent Cloud to BigQuery migration. This method of connecting Confluent Cloud to BigQuery is suitable for users with a technical background. You can use the Confluent Cloud to BigQuery Sink connector for exporting Confluent Cloud to BigQuery.

Confluent Cloud allows businesses to store their real-time data by automating processes like creating clusters and Kafka topics. The Sink connector can create BigQuery tables with the appropriate BigQuery table schema when streaming data from Confluent Cloud with registered schemas. Follow the below steps to connect Confluent Cloud to BigQuery.

  • Install the BigQuery Connector

This is the first step to connecting Confluent Cloud to BigQuery. Install the BigQuery Connector using the Confluent hub client or download the zip file manually using the below command.

confluent-hub install wepay/kafka-connect-bigquery:latest

You can install a dedicated version of the connector by replacing “latest” with a version number. For example: 

confluent-hub install wepay/kafka-connect-bigquery:1.1.2

The Confluent Cloud to BigQuery Sink connector can stream table records from Kafka topics to Google BigQuery. The table records are streamed at high throughput rates to facilitate analytical queries in real-time. In this tutorial, you need to run the Confluent platform locally. 

Go to your Confluent platform installation directory and enter the below command.

confluent-hub install wepay/kafka-connect-bigquery:latest

Adding a new connector plugin for the Confluent Cloud to BigQuery Sink connector requires restarting Kafka Connect. Using the below command, you can use the Confluent CLI to restart Kafka Connect.

confluent local stop connect && confluent local start connect
Using CONFLUENT_CURRENT: /Users/username/Sandbox/confluent-snapshots/var/confluent.NuZHxXfq
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]

Verify that the installed BigQuery Sink connector plugin is installed correctly, using the below command.

curl -sS localhost:8083/connector-plugins | jq .[].class | grep BigQuerySinkConnector
"com.wepay.kafka.connect.bigqueryl.BigQuerySinkConnector"
  • Setting up the Google Cloud Platform BigQuery Connector
    • Ensure that you have the below prerequisites for setting up the connector.
      • An active Google Cloud Platform account.
      • A BigQuery project. You can build the project using the Google Cloud Console.
      • A BigQuery dataset.
      • A service account to access the BigQuery project containing the dataset. While creating the service account, you can create and download a key as a JSON file, as shown in the example below.
   {
 	"type": "service_account",
 	"project_id": "confluent-842583",
 	"private_key_id": "...omitted...",
 	"private_key": "-----BEGIN PRIVATE ...omitted... =\n-----END PRIVATE KEY-----\n",
 	"client_email": "confluent2@confluent-842583.iam.gserviceaccount.com",
 	"client_id": "...omitted...",
 	"auth_uri": "https://accounts.google.com/oauth2/auth",
 	"token_uri": "https://oauth2.googleapis.com/token",
 	"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/certs",
 	"client_x509_cert_url": "https://www.googleapis.com/robot/metadata/confluent2%40confluent-842583.iam.gserviceaccount.com"
  }

According to `GCP specifications
<https://cloud.google.com/bigquery/docs/access-control>`__, the service
account will either need the **BigQueryEditor** primitive IAM role or the
**bigquery.dataEditor** predefined IAM role. The minimum permissions are as
follows:

.. code-block:: text

   bigquery.datasets.get
   bigquery.tables.create
   bigquery.tables.get
   bigquery.tables.getData
   bigquery.tables.list
   bigquery.tables.update
   bigquery.tables.updateData
  • Starting the BigQuery Sink Connector
    • Follow the below steps to start the BigQuery Sink Connector.
      • Create the file named register-kcbd-connect-bigquery.json for storing the connector configuration.
      • Start the connector with the below command.
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-kcbd-connect-bigquery.json
  • Start your Kafka Producer
    • Follow the below steps to start your Kafka producer.
      • Go to the Kafka bin folder and start the Kafka producer in a new terminal.
      • Type the below commands on the terminal.
./kafka-avro-console-producer --broker-list localhost:9092 --topic kcbq-quickstart1 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
  • Enter the text of two records and press enter after each line.
./kafka-avro-console-producer --broker-list localhost:9092 --topic kcbq-quickstart1 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
{"f1":"Testing the Kafka-BigQuery Connector!"}
{"f1":"Testing the Kafka-BigQuery Connector for a second time!"}
  • Check Results in BigQuery
    • Follow the below steps to check results in BigQuery.
      • Navigate to the BigQuery editor in Google Cloud Platform.
      • Enter the below SQL select statement.
SELECT * FROM ConfluentDataSet.quickstart1;
  • Clean Up Resources
    • Follow the below steps to clean up the resources.
      • Delete the connector.
curl -X DELETE localhost:8083/connectors/kcbq-connect1
  • Stop Confluent services.
confluent local stop

Conclusion

This article talks about connecting Confluent Cloud to BigQuery. Confluent Cloud helps organizations to use their real-time data streams and leverage the functionalities of Apache Kafka. It enables organizations to view and create environments, clusters, and Kafka topics more quickly, without worrying about its mechanism. Organizations can store their real-time data streams in a data warehouse like Google BigQuery to analyze the data and gain meaningful insights for better decisions.

Sign Up for a 14-day free trial to reliably integrate data with Hevo’s Fully Automated No Code Data Pipeline. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Manjiri Gaikwad
Technical Content Writer, Hevo Data

Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.

No-Code Data Pipeline for Google BigQuery