Various organizations rely on the open-source streaming platform Kafka to build real-time data applications and pipelines. These organizations are also looking to modernize their IT landscape and adopt BigQuery to meet their growing analytics needs.

By establishing a connection from Kafka to BigQuery, these organizations can quickly activate and analyze data-derived insights as they happen, as opposed to waiting for a batch process to be completed.

Methods to Set up Kafka to BigQuery Connection

You can easily set up your Kafka to BigQuery connection using the following 2 methods.

Method 1: Using Hevo Data to Move Data from Kafka to BigQuery

Hevo is the only real-time ELT No-code data pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready with zero data loss.

Sign up here for a 14-day free trial

Hevo takes care of all your data preprocessing needs required to set up Kafka to BigQuery Integration and lets you focus on key business activities. Hevo provides a one-stop solution for all Kafka use cases and collects the data stored in their Topics & Clusters. Moreover, Since Google BigQuery has built-in support for nested and repeated columns, Hevo neither splits nor compresses the JSON data.

Here are the steps to move data from Kafka to BigQuery using Hevo:

  • Authenticate Kafka Source: Configure Kafka as the source for your Hevo Pipeline by specifying Broker and Topic Names.
Kafka to BigQuery: Kafka Source | Hevo Data
Image Source

Check out our documentation to know more about the connector

  • Configure BigQuery Destination: Configure the Google BigQuery Data Warehouse account, where the data needs to be streamed, as your destination for the Hevo Pipeline.
Kafka to BigQuery: BigQuery Destination | Hevo Data
Image Source

Read more on our BigQuery connector here.

With continuous Real-Time data movement, Hevo allows you to combine Kafka data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Hevo Data also offers live support, and easy transformations, and has been built to keep up with your needs as your operation scales up. Try our 14-day full-feature access free trial!

Key features of Hevo are:

  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
Integrate Kafka to BigQuery
Integrate Kafka to Snowflake

Method 2: Using Custom Code to Move Data from Kafka to BigQuery

The steps to build a custom-coded data pipeline between Apache Kafka and BigQuery are divided into 2, namely:

Step 1: Streaming Data from Kafka

There are various methods and open-source tools which can be employed to stream data from Kafka. This blog covers the following methods:

Streaming with Kafka Connect

Kafka Connect is an open-source component of Kafka. It is designed by Confluent to connect Kafka with external systems such as databases, key-value stores, file systems et al.

It allows users to stream data from Kafka straight into BigQuery with sub-minute latency through its underlying framework. Kafka connect gives users the incentive of making use of existing connector implementations so you don’t need to draw up new connections when moving new data. Kafka Connect provides a ‘SINK’ connector that continuously consumes data from consumed Kafka topics and streams to external storage location in seconds. It also has a ‘SOURCE’ connector that ingests databases as a whole and streams table updates to Kafka topics. 

There is no inbuilt connector for Google BigQuery in Kafka Connect. Hence, you will need to use third-party tools such as Wepay. When making use of this tool, Google BigQuery tables can be auto-generated from the AVRO schema seamlessly. The connector also aids in dealing with schema updates. As Google BigQuery streaming is backward compatible, it enables users to easily add new fields with default values, and steaming will continue uninterrupted.

Using Kafka Connect, the data can be streamed and ingested into Google BigQuery in real-time. This, in turn, gives users the advantage to carry out analytics on the fly.

Limitations of Streaming with Kafka Connect
  • In this method, data is partitioned only by the processing time.

Streaming Data with Apache Beam

Apache Beam is an open-source unified programming model that implements batch and stream data processing jobs that run on a single engine. The Apache Beam model helps abstract all the complexity of parallel data processing. This allows you to focus on what is required of your Job not how the Job gets executed.

One of the major downsides of streaming with Kafka Connect is that it can only ingest data by the processing time which can lead to data arriving in the wrong partition. Apache Beam resolves this issue as it supports both batch and stream data processing. 

Apache Beam has a supported distributed processing backend called Cloud Data Flow that executes your code as a cloud job making it fully managed and auto-scaled. The number of workers is fully elastic as it changes according to your current workload and the cost of execution is altered concurrently. 

Limitations of Streaming Data with Apache Beam

  • Apache Beam incurs an extra cost for running managed workers.
  • Apache Beam is not a part of the Kafka ecosystem.

Hevo supports both Batch Load & Streaming Load for the Kafka to BigQuery use case and provides a no-code, fully-managed & minimal maintenance solution for this use case.

Step 2: Ingesting Data to BigQuery

Before you start streaming in from Kafka to BigQuery, you need to check the following boxes:

  • Make sure you have the Write access to the dataset that contains your destination table to prevent subsequent errors when streaming.
  • Check the quota policy for streaming data on BigQuery to ensure you are not in violation of any of the policies.
  • Ensure that billing is enabled for your GCP (Google Cloud Platform) account. This is because streaming is not available for the free tier of GCP, hence if you want to stream data into Google BigQuery you have to make use of the paid tier.

Now, let us discuss the methods to ingest our streamed data from Kafka to BigQuery. The following approaches are covered in this post:

Streaming with BigQuery API

The Google BigQuery API is a data platform for users to manage, create, share and query data. It supports streaming data directly into Google BigQuery with a quota of up 100K rows per project. 

Real-time data streaming on Google BigQuery API costs $0.05 per GB. To make use of Google BigQuery API, it has to be enabled on your account. To enable the API:

  • Ensure that you have a project created.
  • In the GCP Console, click on the hamburger menu and select APIs and services and click on the dashboard.
Kafka to BigQuery: Google Cloud Platform | Hevo Data
Image Source: Self
  • In the API and services window, select enable API and Services.
  • A search query will pop up. Enter Google BigQuery. Two search results of Google BigQuery Data Transfer and Google BigQuery API will pop up. Select both of them and enable them.
Kafka to BigQuery: BigQuery API | Hevo Data
Image Source: Self

With Google BigQuery API enabled, the next step would be to move the data from Apache Kafka through a stream processing framework like Kafka streams into Google BigQuery. Kafka Streams is an open-source library for building scalable streaming applications on top of Apache Kafka. Kafka Streams allow users to execute their code as a regular Java application. The pipeline flows from an ingested Kafka topic and some filtered rows through streams from Kafka to BigQuery. It supports both processing time and event time partitioning models. 

Limitations of Streaming with BigQuery API

  • Though streaming with the Google BigQuery API gives complete control over your records you have to design a robust system to enable it to scale successfully.
  • You have to handle all streaming errors and downsides independently.

Batch Loading Into Google Cloud Storage (GCS)

To use this technique you could make use of Secor. Secor is a tool designed to deliver data from Apache Kafka into object storage systems such as GCS and Amazon S3. From GCS we then load the data into Google BigQuery using either a load job, manually via the BigQuery UI, or through Google BigQuery’s command line Software Development Kit (SDK).

Limitations of Batch Loading in GCS

  • Secor lacks support for AVRO input format, this forces you to always use a JSON-based input format. 
  • This is a two-step process that can lead to latency issues. 
  • This technique does not stream data in real-time. This becomes a blocker in real-time analysis for your business. 
  • This technique requires a lot of maintenance to keep up with new Kafka topics and fields. To update these changes you would need to put in the effort to manually update the schema in the Google BigQuery table.

Method 3: Using the Kafka to BigQuery Connector to Move Data from Apache Kafka to BigQuery

The Kafka BigQuery connector is handy to stream data into BigQuery tables. When streaming data from Apache Kafka topics with registered schemas, the sink connector creates BigQuery tables with appropriate BigQuery table schema, which is based on the Kafka scheme information for the topic.

Here are some limitations associated with the Kafka Connect BigQuery Sink Connector:

  • No support for schemas with floating fields with NaN or +Infinity values.
  • No support for schemas with recursion.
  • If you configure the connector with upsertEnabled or deleteEnabled, it doesn’t support Single Message Transformations modifying the topic name.

Need for Kafka to BigQuery Migration

While you can use the Kafka platform to build real-time data pipelines and applications, you can use BigQuery to modernize your IT landscape, while meeting your growing analytics needs.

Connecting Kafka to BigQuery allows real-time data processing for analyzing and acting on data as it is generated. This enables you to obtain valuable insights and faster decision-making. Common use case for this is in the finance industry, where it is possible to identify fraudulent activities with real-time data processing.

Yet another need for migrating Kafka to BigQuery is scalability. As both platforms are highly scalable, you can handle large data volumes without any performance issues. Scaling your data processing systems for growing data volumes can be done with ease since Kafka can handle millions of messages per second while BigQuery can handle petabytes of data.

Another need for Kafka connect BigQuery is its cost-effectiveness factor. Kafka being an open-source platform won’t include any licensing costs; the pay-as-you-go pricing model of BigQuery means you only need to pay for the data processed. Integrating both platforms requires you to only pay for the data that is processed and analyzed, helping reduce overall costs.

Conclusion

This article provided you with a step-by-step guide on how you can set up Kafka to BigQuery connection using Custom Script or using Hevo. However, there are certain limitations associated with the Custom Script method. You will need to implement it manually, which will consume your time & resources and is error-prone. Moreover, you need working knowledge of the backend tools to successfully implement the in-house Data transfer mechanism.

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 150+ data sources (including 40+ free sources) and can seamlessly transfer your data from Kafka to BigQuery within minutes. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.

Learn more about Hevo

Want to take Hevo for a spin? Signup for a 14-day free trial and experience the feature-rich Hevo suite firsthand.

Share your understanding of the Kafka to BigQuery Connection in the comments below!

mm
Freelance Technical Content Writer, Hevo Data

Bukunmi is curious about learning on complex concepts and latest trends in data science and combines his flair for writing to curate content for data teams to help them solve business challenges.