Apache Kafka to BigQuery – Steps to Stream Data in Real-Time

on Data Integration • January 30th, 2020 • Write for Hevo

Organizations today have access to a wide stream of data. Data is generated from recommendation engines, page-clicks, internet searches, product orders and more. It is necessary to have an infrastructure that would enable you to stream your data as it gets generated and carry out analytics on the go. To aid this objective, incorporating a data pipeline from Apache Kafka to BigQuery is a step in the right direction. 

In this blog post, we will be highlighting how you can stream data from Apache Kafka to BigQuery and in turn power scalable and efficient delivery of data in real-time. 

Here’s a snapshot of what you will be seeing in this blog:

Introduction to Apache Kafka

Apache Kafka is an open-source distributed event streaming platform. It provides a reliable pipeline to process data generated from various sources, sequentially and incrementally. Kafka handles both on-line and off-line data consumption as the ingested data is persisted on disk and replicated within central clusters to prevent data loss. Kafka runs on a distributed system that is split into multiple running machines that work together in a single cluster. Apache Kafka provides its uses with use cases such as:

  • Publish and subscribe to streams of records
  • Store streams of records in a fault-tolerant way
  • Process streams of records as they occur
  • Provide a framework to develop a logic to perform analytics across streams of data using Kafka streams.

Kafka is usually used to build real-time data streaming pipelines and data streaming applications that adapt to data streams.

Introduction to Google BigQuery

BigQuery is a scalable and fully managed data warehouse built by Google that runs super-fast SQL queries. It analyses data on a massive scale and runs a fully serverless system that abstracts you from managing any form of infrastructure, hence you are given the liberty to focus mainly on analytics. BigQuery provides a partitioning model that allows us to choose how you want your ingested data to be queried. The partitioning model is based on the concepts below:

  • Processing Time: This partition model is based on the time an event was observed usually the ingestion date.
  • Event Time: In this case, the table is partitioned based on one of the TIMESTAMP/DATE fields on the incoming record.

These partitions allow us to avoid expensive and time-consuming full scans as you’d only pay for the period queried. BigQuery provides support for both batch and stream loading data ingestion methods. You can read more on BigQuery here. 

Now that we have covered some background information concerning both Apache Kafka and Google BigQuery, next up let us look at the options we have to stream data from Kafka to BigQuery.

Methods to load data from Kafka to BigQuery

Broadly, there are two ways to stream data from Kafka to BigQuery:

  • Approach 1: Implement a no-code Data Pipeline tool like Hevo Data. This will help you seamlessly move data in real-time.
  • Approach 2: Write custom ETL code to move data from Kafka to BigQuery.

This blog talks about approach 2 in detail. The blog also highlights some of the key challenges faced while using this method. Let’s dive in. 

Using custom code to move data from Kafka to BigQuery

The steps to build a data pipeline from Apache Kafka to BigQuery is divided into 2, namely:

  1. Streaming Data from Kafka
  2. Ingesting Data into BigQuery

Step 1: Streaming Data from Kafka

There are various methods and open-source tools which can be employed to stream data from Kafka. This blog covers the following methods:

  • Streaming with Kafka Connect
  • Streaming with Apache Beam

Streaming with Kafka Connect

Kafka Connect is an open-source component of Kafka. It is designed by Confluent for the purpose of connecting Kafka with external systems such as databases, key-value stores, file systems et al.

It allows users to stream data from Kafka straight into BigQuery with sub-minute latency through its underlying framework. Kafka connect gives users the incentive of making use of existing connector implementations so you don’t need to draw up new connections when moving new data. Kafka Connect provides a ‘SINK’ connector that continuously consumes data from consumed Kafka topics and streams to external storage location in seconds. It also has a ‘SOURCE’ connector that ingests databases as a whole and streams table updates to Kafka topics. 

There is no inbuilt connector for BigQuery in Kafka connect. Hence, you will need to use an open-source tool Wepay. When making use of this tool, BigQuery tables can be auto-generated from the AVRO schema seamlessly. The connector also aids dealing with schema updates. As BigQuery streaming is backwards compatible, it enables users to easily add new fields with default values and steaming will continue uninterrupted.

Using Kafka Connect, the data can be streamed and ingested into BigQuery in real-time. This, in turn, gives users the advantage to carry out analytics on the fly.

Limitations of streaming with Kafka Connect

  • In this method, data is partitioned only by the processing time 

Streaming Data with Apache Beam

Apache Beam is an open-source unified programming model that implements batch and stream data processing jobs that run on a single-engine. The Apache Beam model helps abstracts all the complexity of parallel data processing. This allows you to focus on what is required of your Job not how the Job gets executed.

One of the major downsides of streaming with Kafka connect is that it can only ingest data by the processing time which can lead to data arriving in the wrong partition. Apache Beam resolves this issue as it supports both batch and stream data processing. 

Beam has a supported distributed processing backend called Cloud Data flow that executes your code as a cloud job making it fully managed and auto-scaled. The number of workers is fully elastic as it changes according to your current workload and the cost of execution is altered concurrently. 

Limitations of Streaming Data with Apache Beam

  • Apache Beam incurs an extra cost for running managed workers
  • Apache Beam is not a part of the Kafka ecosystem.

Step 2: Ingesting Data into BigQuery

Before you start streaming in BigQuery, you need to check the following boxes:

  • Make sure you have the write access to the dataset that contains your destination table to prevent subsequent errors when streaming.
  • Check the quota policy for streaming data on BigQuery to ensure you are not in violation of any of the policies.
  • Ensure that billing is enabled for your GCP(Google Cloud Platform) account. This is because streaming is not available for the free tier of GCP, hence if you want to stream data into BigQuery you have to make use of the paid tier.

Now, let us discuss the methods to ingest our streamed data from Kafka to BigQuery. The following approaches are covered in this post:

  • Streaming with BigQuery API
  • Batch Loading into Google Cloud Storage (GCS)

Streaming with BigQuery API

The BigQuery API is a data platform for users to manage, create, share and query data. It supports streaming data directly into BigQuery with a quota of up 100K rows per project. 

Real-time data streaming on BigQuery API costs $0.05 per GB. To make use of BigQuery API, it has to be enabled on your account. To enable the API:

  • Ensure that you have a project created.
  • In the GCP Console, click on the hamburger menu and select APIs and services and click on the dashboard.
  • In the API and services window, select enable API and Services.
  • A search query will pop up. Enter Big Query. Two search results of Big Query Data Transfer and Big Query API will pop up. Select both of them and enable them.

BigQuery API enabled, next would be to move the data from Apache Kafka through a stream processing framework like Kafka streams into BigQuery. Kafka stream is an open-source library for building scalable streaming applications on top of Apache Kafka. Kafka streams allow users to execute their code as a regular Java application. The pipeline flows from an ingested Kafka topic and some filtered rows through Kafka streams and into BigQuery. It supports both processing time and event time partitioning model. 

Limitations of Streaming with BigQuery API

  • Though streaming with the BigQuery API gives complete control over your records you have to design a robust system to enable it to scale successfully.
  • You have to handle all streaming errors and downsides independently.

Batch Loading Into Google Cloud Storage (GCS)

To use this technique you could make use of Secor. Secor is a tool designed to deliver data form Apache Kafka into object storage systems such as GCS and Amazon S3. From GCS we then load the data into BigQuery using either a load job, manually via the BigQuery UI, or through BigQuery’s command line Software Development Kit (SDK).

Limitations of Batch Loading in GCS

  • Secor lacks support for AVRO input format, this forces you to always use a JSON based input format. 
  • This is a two-step process that can lead to latency issues. 
  • This technique does not stream data in real-time. This becomes a blocker in real-time analysis for your business. 
  • This technique requires a lot of maintenance in order to keep up with new Kafka topics and fields. To update these changes you would need to put in the effort to manually update the schema in the BigQuery table.

A Simpler Alternative to Stream Data from Kafka to BigQuery

Hevo – a data integration platform can mask all complexities and stream data from Kafka to BigQuery in real-time without having to write a single line of code. Using Hevo’s point and click interface, you can start streaming data in just a few mins.

Here are the steps to move data from Kafka to BigQuery using Hevo:

  1. Configure Kafka Source by specifying Broker and Topic Names
  2. Configure the Google BigQuery Warehouse where the data needs to be streamed

And Voila! Hevo overcomes the limitations of the manual approaches, thereby enabling you to analyze data from Kafka streams in real-time.

In addition to Kafka, Hevo can help you load data from a wide array of data sources – Databases, SaaS applications, Cloud Storage, Custom Sources your business may use and so on.

Sign up for a 14-Day Free Trial with Hevo to experience a seamless, hassle-free data load experience from Kafka to BigQuery.

No-code Data Pipeline for BigQuery