Confluent Cloud to BigQuery Integration: 2 Easy Methods

on Confluent, Data Integration, Data Migration, Data Streaming, Data Warehouse, Google BigQuery • July 14th, 2022 • Write for Hevo

Confluent cloud to bigquery: FI

Organizations today rely on the Confluent Cloud to stream real-time data using Apache Kafka. They can use this real-time data with any popular data warehouse like Google BigQuery for further analysis. Google BigQuery can further combine with powerful BI tools such as PowerBI, Tableau, Google Data Studio, and more to gain meaningful insights into Confluent cloud data, which helps organizations make effective data-driven decisions. Organizations can connect Confluent Cloud data with Google BigQuery using Confluent BigQuery Sink Connector or third-party ETL (extract, transform and load) tools.

In this article, you will learn to connect Confluent Cloud to BigQuery using the Confluent BigQuery Sink connector. 

Table of Contents

Prerequisites

Basics of Apache Kafka

What is Confluent Cloud?

confluent cloud to bigquery: confluent cloud logo
Image Source

Confluent Cloud is a full-scale data streaming platform based on Apache Kafka, where businesses can access, store, and manage data in real-time. It consists of a web interface, local command-line interface, schema Registry, Rest proxy, and more than 100 in-built Kafka connectors and KsqlDB for administrating streaming services.

With Confluent Cloud, organizations can drive business value from their data without worrying about manual processes like data transportation and data integration between different devices. Confluent Cloud simplifies the processes such as connecting data sources to Kafka, building streaming applications, securing, monitoring, and managing Kafka infrastructure.

Key Features of Confluent Cloud

  • Confluent Control Center: Confluent Control Center is a GUI-based application to manage and monitor Kafka. It allows organizations to manage Kafka Connect and easily create, edit, and manage connections with other systems. Organizations can monitor data streams from producer and consumer, ensure every message is delivered, and measure how long it takes to deliver these messages. Organizations can use Control Center to build a production data pipeline based on Kafka without writing a single line of code.
  • Self-Balancing: The Confluent Platform can deploy hundreds of brokers, manage thousands of Kafka topics and handle millions of messages per hour. It might sometimes happen that the brokers are lost due to network issues. As a result, new topics are created, and partitions are reassigned to balanced the workload. This results in generating more resource workload overhead.
  • Confluent Cluster Linking: Cluster linking in Confluent Cloud is a fully managed service to move data from one Confluent cluster to another. Cluster linking creates perfect copies of your Kafka topics and programmatically keeps data in-sync across clusters. It can quickly build multi-data centers, multi-region, and hybrid cloud deployments.

What is Google BigQuery?

confluent cloud to bigquery: bigquery logo
Image Source

Developed in 2010, Google BigQuery is a highly scalable, fully managed data warehouse with many cloud functionalities, which helps organizations to store terabytes of data seamlessly. It leverages standard SQL queries for analyzing and obtaining answers from a colossal amount of data faster. Google BigQuery consists of columnar storage that helps businesses provide fast performance and high data compression capabilities.

BigQuery is also called SQL-based Data Warehouse as a Service (DWaaS) with zero infrastructure management. It is a serverless warehouse that does not require any upfront hardware provisioning or management.  BigQuery runs SQL Queries and all requests are to be authenticated. Google provides a complete package to their users with Big Data loading features on Google Cloud Storage and connectivity with various Google apps like Apps Script. Google BigQuery has many built-in features like Machine Learning and AI Capabilities, Geospatial Analysis, and Business Intelligence.

BigQuery uses a Columnar Storage format that is optimized for analytical queries to store data. BigQuery displays data in tables, rows, and columns, with full database transaction semantics support (ACID).

Google BigQuery allows developers and data scientists to work with programming languages such as Java, Python, Go, and JavaScript. They can also use BigQuery API to transform and manage data effectively.

Key Features of Google BigQuery

  • Machine Learning: Google BigQuery allows businesses to create and execute Machine Learning Models using standard SQL queries. Organizations can execute machine learning models like Linear Regression, Multi-class Regression, Binary Logistic Regression, K-means clustering, and more with BigQuery.
  • BigQuery BI Engine: BigQuery consists of a BI engine that helps businesses process large volumes of data with sub-second query response time and high concurrency. This BI engine can integrate with powerful tools such as Google Data Studio, Tableau, Power BI, and more to gain meaningful insights.
  • Real-time Analytics: BigQuery keeps businesses up to date with real-time transfer and analytics. It assists businesses in allocating resources independently to deliver the best performance and outcomes that allow organizations to generate business reports as required.
  • User Friendly: With BigQuery, organizations can store and analyze their data with simple clicks. The user-friendly web interface of BigQuery allows organizations to set up a cloud data warehouse easily without installing clusters, choosing storage sizes, encryption settings, and more.

Reliably integrate data with Hevo’s Fully Automated No Code Data Pipeline

If yours is anything like the 1000+ data-driven companies that use Hevo more than 70% of the business apps you use are SaaS applications. Integrating the data from these sources in a timely way is crucial to fuel analytics and the decisions that are taken from it. But given how fast API endpoints etc can change, creating and managing these pipelines can be a soul-sucking exercise.

Hevo’s no-code data pipeline platform lets you connect over 150+ sources in a matter of minutes to deliver data in near real-time to your warehouse. What’s more, the in-built transformation capabilities and the intuitive UI means even non-engineers can set up pipelines and achieve analytics-ready data in minutes. 

All of this combined with transparent pricing and 24×7 support makes us the most loved data pipeline software in terms of user reviews.

Take our 14-day free trial to experience a better way to manage data pipelines.

Get started for Free with Hevo!

Connecting Confluent Cloud to BigQuery

Method 1: Using Hevo to Set Up Confluent Cloud to BigQuery

confluent cloud to bigquery: Hevo Logo
Image Source

Hevo provides an Automated No-code Data Pipeline that helps you move your Confluent Cloud data to BigQuery. Hevo is fully-managed and completely automates the process of not only loading data from your 150+ data sources(including 40+ free sources)but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Using Hevo, you can connect Confluent Cloud to BigQuery in the following 2 steps

  • Step 1: Configure Confluent Cloud as the Source in your Pipeline by following these steps:
    • Step 1.1: In the Asset Palette, select PIPELINES.
    • Step 1.2: In the Pipelines List View, click + CREATE.
    • Step 1.3: Select “Confluent Cloud” on the Select Source Type page.
    • Step 1.4: Enter the following information on the Configure your Invoiced Source page:
      • Pipeline Name: A distinct, 255-character-maximum name for the Pipeline.
      • Bootstrap Server(s): The server(s) used for bootstrapping, taken from Kafka Confluent Cloud. 
      • API Key: The API key that was retrieved from your Confluent Cloud account.
      • API Secret: Your Kafka Confluent Cloud account’s API secret.
confluent cloud to bigquery: configure confluent cloud as source
Image Source
  • Step 1.5: Simply press TEST & CONTINUE.
  • Step 1.6: Configure the data ingestion and establish the destination after that.
  • Step 2: To set up Google BigQuery as a destination in Hevo, follow these steps:
    • Step 2.1: In the Asset Palette, select DESTINATIONS.
    • Step 2.2: In the Destinations List View, click + CREATE.
    • Step 2.3: Select Google BigQuery from the Add Destination page.
    • Step 2.4: Choose the BigQuery connection authentication method on the Configure your Google BigQuery Account page.
confluent cloud to bigquery: configure google bigquery account
Image Source
  • Step 2.5: Choose one of these:
    • Using a Service Account to connect:
      • Service Account Key file, please attach.
      • Note that Hevo only accepts key files in JSON format.
      • Go to CONFIGURE GOOGLE BIGQUERY ACCOUNT and click it.
    • Using a user account to connect:
      • To add a Google BigQuery account, click +.
      • Become a user with BigQuery Admin and Storage Admin permissions by logging in.
      • To grant Hevo access to your data, click Allow.
confluent cloud to bigquery: hevo access
Image Source
  • Step 2.6: Set the following parameters on the Configure your Google BigQuery page:
    • Destination Name: A unique name for your Destination.
    • Project ID: The BigQuery Project ID that you were able to retrieve in Step 2 above and for which you had permitted the previous steps.
    • Dataset ID: Name of the dataset that you want to sync your data to, as retrieved in Step 3 above.
    • GCS Bucket: To upload files to BigQuery, they must first be staged in the cloud storage bucket that was retrieved in Step 4 above.
    • Enable Streaming Inserts: Enable this option to load data via a job according to a defined Pipeline schedule rather than streaming it to your BigQuery Destination as it comes in from the Source. To learn more, go to Near Real-time Data Loading Using Streaming. The setting cannot be changed later.
    • Sanitize Table/Column Names: Activate this option to replace the spaces and non-alphanumeric characters in between the table and column names with underscores ( ). Name Sanitization is written.
confluent cloud to bigquery: configure redshift as destination
Image Source
  • Step 2.5: Click Test Connection to test connectivity with the Amazon Redshift warehouse.
  • Step 2.6: Once the test is successful, click SAVE DESTINATION.

Deliver Smarter, Faster Insights with your Unified Data

Using manual scripts and custom code to move data into the warehouse is cumbersome. Changing API endpoints and limits, ad-hoc data preparation, and inconsistent schema makes maintaining such a system a nightmare. Hevo’s reliable no-code data pipeline platform enables you to set up zero-maintenance data pipelines that just work seamlessly.

  • Wide Range of Connectors: Instantly connect and read data from 150+ sources including SaaS apps and databases, and precisely control pipeline schedules down to the minute.
  • In-built Transformations: Format your data on the fly with Hevo’s preload transformations using either the drag-and-drop interface or our nifty python interface. Generate analysis-ready data in your warehouse using Hevo’s Postload Transformation.
  • Near Real-Time Replication: Get access to near real-time replication for all database sources with log-based replication. For SaaS applications, near real-time replication is subject to API limits.   
  • Auto-Schema Management: Correcting improper schema after the data is loaded into your warehouse is challenging. Hevo automatically maps source schema with destination warehouse so that you don’t face the pain of schema errors.
  • Transparent Pricing: Say goodbye to complex and hidden pricing models. Hevo’s Transparent Pricing brings complete visibility to your ELT spend. Choose a plan based on your business needs. Stay in control with spend alerts and configurable credit limits for unforeseen spikes in the data flow.
  • 24×7 Customer Support: With Hevo you get more than just a platform, you get a partner for your pipelines. Discover peace with round-the-clock “Live Chat” within the platform. What’s more, you get 24×7 support even during the 14-day free trial.
  • Security: Discover peace with end-to-end encryption and compliance with all major security certifications including HIPAA, GDPR, and SOC-2.
Get started for Free with Hevo!

Get Started for Free with Hevo’s 14-day Free Trial.

Method 2: Using Custom Code to Move Data from Confluent Cloud to BigQuery

This method of connecting Confluent Cloud to BigQuery uses a custom code approach. This method would be time-consuming and somewhat tedious to implement. Users will have to write custom codes to enable Confluent Cloud to BigQuery migration. This method of connecting Confluent Cloud to BigQuery is suitable for users with a technical background. You can use the Confluent Cloud to BigQuery Sink connector for exporting Confluent Cloud to BigQuery.

Confluent Cloud allows businesses to store their real-time data by automating processes like creating clusters and Kafka topics. The Sink connector can create BigQuery tables with the appropriate BigQuery table schema when streaming data from Confluent Cloud with registered schemas. Follow the below steps to connect Confluent Cloud to BigQuery.

  • Install the BigQuery Connector

This is the first step to connecting Confluent Cloud to BigQuery. Install the BigQuery Connector using the Confluent hub client or download the zip file manually using the below command.

confluent-hub install wepay/kafka-connect-bigquery:latest

You can install a dedicated version of the connector by replacing “latest” with a version number. For example: 

confluent-hub install wepay/kafka-connect-bigquery:1.1.2

The Confluent Cloud to BigQuery Sink connector can stream table records from Kafka topics to Google BigQuery. The table records are streamed at high throughput rates to facilitate analytical queries in real-time. In this tutorial, you need to run the Confluent platform locally. 

Go to your Confluent platform installation directory and enter the below command.

confluent-hub install wepay/kafka-connect-bigquery:latest

Adding a new connector plugin for the Confluent Cloud to BigQuery Sink connector requires restarting Kafka Connect. Using the below command, you can use the Confluent CLI to restart Kafka Connect.

confluent local stop connect && confluent local start connect
Using CONFLUENT_CURRENT: /Users/username/Sandbox/confluent-snapshots/var/confluent.NuZHxXfq
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]

Verify that the installed BigQuery Sink connector plugin is installed correctly, using the below command.

curl -sS localhost:8083/connector-plugins | jq .[].class | grep BigQuerySinkConnector
"com.wepay.kafka.connect.bigqueryl.BigQuerySinkConnector"
  • Setting up the Google Cloud Platform BigQuery Connector
    • Ensure that you have the below prerequisites for setting up the connector.
      • An active Google Cloud Platform account.
      • A BigQuery project. You can build the project using the Google Cloud Console.
      • A BigQuery dataset.
      • A service account to access the BigQuery project containing the dataset. While creating the service account, you can create and download a key as a JSON file, as shown in the example below.
   {
 	"type": "service_account",
 	"project_id": "confluent-842583",
 	"private_key_id": "...omitted...",
 	"private_key": "-----BEGIN PRIVATE ...omitted... =\n-----END PRIVATE KEY-----\n",
 	"client_email": "confluent2@confluent-842583.iam.gserviceaccount.com",
 	"client_id": "...omitted...",
 	"auth_uri": "https://accounts.google.com/oauth2/auth",
 	"token_uri": "https://oauth2.googleapis.com/token",
 	"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/certs",
 	"client_x509_cert_url": "https://www.googleapis.com/robot/metadata/confluent2%40confluent-842583.iam.gserviceaccount.com"
  }

According to `GCP specifications
<https://cloud.google.com/bigquery/docs/access-control>`__, the service
account will either need the **BigQueryEditor** primitive IAM role or the
**bigquery.dataEditor** predefined IAM role. The minimum permissions are as
follows:

.. code-block:: text

   bigquery.datasets.get
   bigquery.tables.create
   bigquery.tables.get
   bigquery.tables.getData
   bigquery.tables.list
   bigquery.tables.update
   bigquery.tables.updateData
  • Starting the BigQuery Sink Connector
    • Follow the below steps to start the BigQuery Sink Connector.
      • Create the file named register-kcbd-connect-bigquery.json for storing the connector configuration.
      • Start the connector with the below command.
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-kcbd-connect-bigquery.json
  • Start your Kafka Producer
    • Follow the below steps to start your Kafka producer.
      • Go to the Kafka bin folder and start the Kafka producer in a new terminal.
      • Type the below commands on the terminal.
./kafka-avro-console-producer --broker-list localhost:9092 --topic kcbq-quickstart1 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
  • Enter the text of two records and press enter after each line.
./kafka-avro-console-producer --broker-list localhost:9092 --topic kcbq-quickstart1 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
{"f1":"Testing the Kafka-BigQuery Connector!"}
{"f1":"Testing the Kafka-BigQuery Connector for a second time!"}
  • Check Results in BigQuery
    • Follow the below steps to check results in BigQuery.
      • Navigate to the BigQuery editor in Google Cloud Platform.
      • Enter the below SQL select statement.
SELECT * FROM ConfluentDataSet.quickstart1;
  • Clean Up Resources
    • Follow the below steps to clean up the resources.
      • Delete the connector.
curl -X DELETE localhost:8083/connectors/kcbq-connect1
  • Stop Confluent services.
confluent local stop

Conclusion

This article talks about connecting Confluent Cloud to BigQuery. Confluent Cloud helps organizations to use their real-time data streams and leverage the functionalities of Apache Kafka. It enables organizations to view and create environments, clusters, and Kafka topics more quickly, without worrying about its mechanism. Organizations can store their real-time data streams in a data warehouse like Google BigQuery to analyze the data and gain meaningful insights for better decisions.

Visit our Website to Explore Hevo

Hevo offers a No-code Data Pipeline that can automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Marketing, Customer Management, etc.

This platform allows you to transfer data from 150+ sources (including 40+ Free Sources) such as Confluent Cloud and Cloud-based Data Warehouses like Snowflake, Google BigQuery, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? 

Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

No-Code Data Pipeline for Google BigQuery