Organizations today have access to a wide stream of data. Data is generated from recommendation engines, page clicks, internet searches, product orders, and more. It is necessary to have an infrastructure that would enable you to stream your data as it gets generated and carry out analytics on the go. To aid this objective, incorporating a data pipeline for moving data from Apache Kafka to BigQuery is a step in the right direction.
What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform. It provides a reliable pipeline to process data generated from various sources, sequentially and incrementally. Kafka handles both online and offline data consumption as the ingested data is persisted on disk and replicated within central clusters to prevent data loss. Kafka runs on a distributed system that is split into multiple running machines that work together in a single cluster. Apache Kafka provides its users with use cases such as:
- Publish and subscribe to streams of records.
- Store streams of records in a fault-tolerant way.
- Process streams of records as they occur.
- Provide a framework to develop a logic to perform analytics across streams of data using Kafka streams.
Kafka is usually used to build real-time data streaming pipelines and data streaming applications that adapt to data streams.
What is Google BigQuery?
Google BigQuery is a scalable and fully managed data warehouse built by Google that runs super-fast SQL queries. Google BigQuery’s Architecture is built on top of Dremel technology. Dremel is Google’s interactive ad-hoc query system for the analysis of read-only nested data.
It analyses data on a massive scale and runs a fully serverless system that abstracts you from managing any form of infrastructure, hence you are given the liberty to focus mainly on analytics. Google BigQuery provides a partitioning model that allows us to choose how you want your ingested data to be queried. The partitioning model is based on the concepts below:
- Processing Time: This partition model is based on the time an event was observed usually the ingestion date.
- Event Time: In this case, the table is partitioned based on one of the TIMESTAMP/DATE fields on the incoming record.
These partitions allow us to avoid expensive and time-consuming full scans as you’d only pay for the period queried. BigQuery provides support for both batch and stream loading data ingestion methods.
Now that we have covered some background information concerning both Apache Kafka and Google BigQuery, next up let us look at the options we have to load data from Kafka to BigQuery.
What is the Importance of Connecting Apache Kafka to BigQuery?
Various organizations rely on the open-source streaming platform Kafka to build real-time data applications and pipelines. These organizations are also looking to modernize their IT landscape and adopt BigQuery to meet their growing analytics needs.
By establishing a connection from Kafka to BigQuery, these organizations can quickly activate and analyze data-derived insights as they happen, as opposed to waiting for a batch process to be completed.
Therefore, Kafka to BigQuery migration enables real-time streaming analytics use cases like dynamic recommendations, fraud detection, predictive maintenance, capacity planning, inventory, or fleet management.
Methods to Set up Kafka to BigQuery Connection
You can easily set up your Kafka to BigQuery connection using the following 2 methods:
Method 1: Using Hevo Data to Move Data from Kafka to BigQuery
Hevo Data, a No-code Data Pipeline, helps you directly transfer data from Kafka to BigQuery in a completely hassle-free & automated manner. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Hevo takes care of all your data preprocessing needs required to set up Kafka to BigQuery Integration and lets you focus on key business activities.
Hevo provides a one-stop solution for all Kafka use cases and collects the data stored in their Topics & Clusters. Moreover, Since Google BigQuery has built-in support for nested and repeated columns, Hevo neither splits nor compresses the JSON data.
Here are the steps to move data from Kafka to BigQuery using Hevo:
- Authenticate Kafka Source: Configure Kafka as the source for your Hevo Pipeline by specifying Broker and Topic Names.
- Configure BigQuery Destination: Configure the Google BigQuery Data Warehouse account, where the data needs to be streamed, as your destination for the Hevo Pipeline.
If you want to learn more about how Hevo connects with Apache Kafka & Google BigQuery, you may click on the links here and check our informative documentation page for in-depth understanding!
Have a look at what makes Hevo, an amazing Data Integration solution:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data from Kafka files and maps it to the Google BigQuery destination schema.
- Quick Setup: Hevo with its automated features, can be set up in minimal time. Moreover, with its simple and interactive UI, it is extremely easy for new customers to work on and perform operations.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
With continuous Real-Time data movement, Hevo allows you to combine Kafka data along with your other data sources and seamlessly load it to BigQuery with a no-code, easy-to-setup interface. Hevo Data also offers live support, and easy transformations, and has been built to keep up with your needs as your operation scales up. Try our 14-day full-feature access free trial!
Get Started with Hevo for Free
Method 2: Using Custom Code to Move Data from Kafka to BigQuery
The steps to build a custom-coded data pipeline between Apache Kafka and BigQuery are divided into 2, namely:
Step 1: Streaming Data from Kafka
There are various methods and open-source tools which can be employed to stream data from Kafka. This blog covers the following methods:
Streaming with Kafka Connect
Kafka Connect is an open-source component of Kafka. It is designed by Confluent to connect Kafka with external systems such as databases, key-value stores, file systems et al.
It allows users to stream data from Kafka straight into BigQuery with sub-minute latency through its underlying framework. Kafka connect gives users the incentive of making use of existing connector implementations so you don’t need to draw up new connections when moving new data. Kafka Connect provides a ‘SINK’ connector that continuously consumes data from consumed Kafka topics and streams to external storage location in seconds. It also has a ‘SOURCE’ connector that ingests databases as a whole and streams table updates to Kafka topics.
There is no inbuilt connector for Google BigQuery in Kafka Connect. Hence, you will need to use third-party tools such as Wepay. When making use of this tool, Google BigQuery tables can be auto-generated from the AVRO schema seamlessly. The connector also aids in dealing with schema updates. As Google BigQuery streaming is backward compatible, it enables users to easily add new fields with default values, and steaming will continue uninterrupted.
Using Kafka Connect, the data can be streamed and ingested into Google BigQuery in real-time. This, in turn, gives users the advantage to carry out analytics on the fly.
Limitations of Streaming with Kafka Connect
- In this method, data is partitioned only by the processing time.
Streaming Data with Apache Beam
Apache Beam is an open-source unified programming model that implements batch and stream data processing jobs that run on a single engine. The Apache Beam model helps abstract all the complexity of parallel data processing. This allows you to focus on what is required of your Job not how the Job gets executed.
One of the major downsides of streaming with Kafka Connect is that it can only ingest data by the processing time which can lead to data arriving in the wrong partition. Apache Beam resolves this issue as it supports both batch and stream data processing.
Apache Beam has a supported distributed processing backend called Cloud Data Flow that executes your code as a cloud job making it fully managed and auto-scaled. The number of workers is fully elastic as it changes according to your current workload and the cost of execution is altered concurrently.
Limitations of Streaming Data with Apache Beam
- Apache Beam incurs an extra cost for running managed workers.
- Apache Beam is not a part of the Kafka ecosystem.
Hevo supports both Batch Load & Streaming Load for the Kafka to BigQuery use case and provides a no-code, fully-managed & minimal maintenance solution for this use case.
Step 2: Ingesting Data to BigQuery
Before you start streaming in from Kafka to BigQuery, you need to check the following boxes:
- Make sure you have the Write access to the dataset that contains your destination table to prevent subsequent errors when streaming.
- Check the quota policy for streaming data on BigQuery to ensure you are not in violation of any of the policies.
- Ensure that billing is enabled for your GCP (Google Cloud Platform) account. This is because streaming is not available for the free tier of GCP, hence if you want to stream data into Google BigQuery you have to make use of the paid tier.
Now, let us discuss the methods to ingest our streamed data from Kafka to BigQuery. The following approaches are covered in this post:
Streaming with BigQuery API
The Google BigQuery API is a data platform for users to manage, create, share and query data. It supports streaming data directly into Google BigQuery with a quota of up 100K rows per project.
Real-time data streaming on Google BigQuery API costs $0.05 per GB. To make use of Google BigQuery API, it has to be enabled on your account. To enable the API:
- Ensure that you have a project created.
- In the GCP Console, click on the hamburger menu and select APIs and services and click on the dashboard.
- In the API and services window, select enable API and Services.
- A search query will pop up. Enter Google BigQuery. Two search results of Google BigQuery Data Transfer and Google BigQuery API will pop up. Select both of them and enable them.
With Google BigQuery API enabled, the next step would be to move the data from Apache Kafka through a stream processing framework like Kafka streams into Google BigQuery. Kafka Streams is an open-source library for building scalable streaming applications on top of Apache Kafka. Kafka Streams allow users to execute their code as a regular Java application. The pipeline flows from an ingested Kafka topic and some filtered rows through streams from Kafka to BigQuery. It supports both processing time and event time partitioning models.
Limitations of Streaming with BigQuery API
- Though streaming with the Google BigQuery API gives complete control over your records you have to design a robust system to enable it to scale successfully.
- You have to handle all streaming errors and downsides independently.
Batch Loading Into Google Cloud Storage (GCS)
To use this technique you could make use of Secor. Secor is a tool designed to deliver data from Apache Kafka into object storage systems such as GCS and Amazon S3. From GCS we then load the data into Google BigQuery using either a load job, manually via the BigQuery UI, or through Google BigQuery’s command line Software Development Kit (SDK).
Limitations of Batch Loading in GCS
- Secor lacks support for AVRO input format, this forces you to always use a JSON-based input format.
- This is a two-step process that can lead to latency issues.
- This technique does not stream data in real-time. This becomes a blocker in real-time analysis for your business.
- This technique requires a lot of maintenance to keep up with new Kafka topics and fields. To update these changes you would need to put in the effort to manually update the schema in the Google BigQuery table.
This article provided you with a step-by-step guide on how you can set up Kafka to BigQuery connection using Custom Script or using Hevo. However, there are certain limitations associated with the Custom Script method. You will need to implement it manually, which will consume your time & resources and is error-prone. Moreover, you need working knowledge of the backend tools to successfully implement the in-house Data transfer mechanism.
Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. Hevo caters to 100+ data sources (including 40+ free sources) and can seamlessly transfer your data from Kafka to BigQuery within minutes. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner without having to write any code. It will make your life easier and make data migration hassle-free.
Learn more about Hevo
Want to take Hevo for a spin? Signup for a 14-day free trial and experience the feature-rich Hevo suite firsthand.
Share your understanding of the Kafka to BigQuery Connection in the comments below!