Many organizations or businesses utilize different data migrations to collect, process, and analyze large amounts of data in real-time to derive meaningful insights. For high-performance real-time data streaming and advanced analytics tasks, organizations rely on Apache Kafka and Databricks. 

By migrating data from Apache Kafka to Databricks, you will get a highly scalable, low-latency, and fault-tolerant unified data platform for analyzing large datasets. Based on the analysis, you can respond to market conditions quickly and make timely decisions.

Let’s look into the detailed steps for migrating data from Apache Kafka and Databricks.

Kafka: An Overview

Kafka Logo

Kafka was initially designed as a distributed messaging queue at LinkedIn to facilitate activity tracking and real-time streaming across its various internal applications. Because of its low-latency feature, Kafka was later open-sourced to the Apache Software Foundation, which became one of the active Apache projects. 

Apache Kafka is now an open-source distributed event streaming platform. Thousands of organizations utilize Kafka to build high-performance real-time data pipelines and event-driven architectures to conduct streaming data analytics.

The key features of Apache Kafka include high throughput, high availability, scalability, and permanent storage. It utilizes a cluster of machines to deliver messages within latencies less than two milliseconds. With Kafka, you can scale up or down the storage and processing resources and get a fault-tolerant and durable cluster to store your data securely. 

You can use several options to access Apache Kafka, including Client Libraries like Java or Python, Command Line Interface tools, or Kafka REST Proxy.  

Databricks: An Overview

Databricks is a data intelligence and unified analytics platform introduced by the creators of Apache Spark. Its goal is to bring AI to your data by unifying data science with engineering and business intelligence.

The Databricks Data Intelligence Platform is based on a Lakehouse architecture that allows everyone in your organization to derive insights from your data using natural language. It is done by performing Extract Transform Load (ETL), data warehousing, and developing AI applications on your data. 

The Databricks Unified Analytics platform helps you quickly preprocess your data on a large scale, continuously train machine learning models, and deploy them for all your AI applications. It will help you build a generative AI application on your data without sacrificing data privacy

Methods for Apache Kafka to Databricks Migration

To migrate data from Apache Kafka to Databricks, you can either use Hevo Data or a custom method.

Method 1: Using Hevo Data to Migrate Data from Apache Kafka to Databricks

Hevo Data is a real-time ELT, no-code, and cost-effective data pipeline platform that automates flexible data pipelines to your requirements. With integration to over 150+ data sources, Hevo Data helps you export data from sources, load it into destinations, and transform it for in-depth analysis. 

Here are some features of Hevo Data:

  • Data Transformation: Hevo Data utilizes analyst-friendly data transformation features to streamline analytics tasks. By writing a Python-based Transformation script or using Drag-and-Drop Transformation blocks, you can clean, prepare, and standardize the data before loading it into your destination.  
  • Incremental Data Load: Hevo Data enables the transfer of modified data and ensures efficient bandwidth utilization at both pipeline ends.
  • Auto Schema Mapping: Hevo Data’s Auto Mapping feature helps eliminate the tedious task of schema management. This feature automatically detects the incoming data format and replicates it to the destination schema. You can choose either full or incremental mappings to meet your data replication needs.
Get Started with Hevo for Free

Let’s look into the detailed steps to set up Apache Kafka to Databricks Hevo pipeline.

Step 1: Configuring Apache Kafka as Your Source

Before getting started, make sure that the following prerequisites are in place:

Here are the steps to configure the Apache Kafka as the source in your Hevo pipeline:

  1. Locate the Bootstrap Server Information
    • Open the server.properties file from your file system and open it.
    • Copy the complete line after bootstrap.servers:

Example: Bootstrap.servers: hostname1:9092, hostname2:9092

Copy hostname1:9092 and hostname2:9092 from the above line, where hostname1:9092 and hostname2:9092 indicate bootstrap server names and port numbers, respectively.

  1. Whitelist Hevo’s IP Addresses

To allow Hevo Data to connect to your Apache Kafka server, you must whitelist the Hevo IP address for your region in the Kafka server configuration file.

  • Go to your Kafka server configuration directory and open the Kafka server configuration file by using the following sample command:

sudo nano /usr/local/etc/config/server.properties

Your path to the Kafka server configuration file can be different.

  • Scroll down to the listeners section in the configuration file. You can add it to a new line if no such section exists.
  • Add the following command under the listeners section:

<protocol>://0.0.0.0:<port> or <protocol>://<hevo_ip>:<port>

Where the <protocol> is an SSL or just a PLAINTEXT, the <hevo_ip> is your region’s Hevo IP address, and the <port> is the same port number used in your bootstrap server (9092). 

  • Save the Kafka server configuration file.
  1. Configure Apache Kafka Connection Settings

Follow the steps below to configure Apache Kafka as the source in your pipeline: 

  1. In the Navigation Bar, click on the PIPELINES.
  2. Go to Pipelines List View and click on the +CREATE.
  3. Choose Kafka from the Select Source Type page, then select Apache Kafka from the Select your Kafka Variant page.
  4. When you’re redirecting to the Configure your Kafka Source page, you must fill in the following fields:
Apache Kafka to Databricks: Configuring Kafka Source page
Apache Kafka to Databricks: Configuring Kafka Source page
  • Pipeline Name: A unique name for the source end of your pipeline, and its length should not exceed 255 characters.
  • Bootstrap Server(s): Extract the bootstrap server information from the Apache Kafka.

Example: hostname1:9092

  • Ingest Data From: You can choose any one of the following:
    • All Topics: This option allows you to ingest data from all topics and automatically include the newly created topics.
    • Specific Topics: This would be the best option for manually choosing a list of comma-separated topics. This option does not automatically add new topics. 
    • Topics Matching a Pattern (Regex): You can choose to specify a Regular Expression (regex) that selects topic names matching with the given regex pattern.
  • Use SSL: If you want to utilize SSL-encrypted connection, enable this option and specify the following:
    • CA File: This includes the SSL Server Certificate Authority (CA).
      • If this option is selected, Hevo Data will load up to 50 CA certificates from the attached CA file.
      • If not selected, Hevo Data will load only the first certificate.
    • Client Certificate: This file contains the client’s public key certificate.
    • Client Key: This file has the client’s private key file.
  1. Click on the TEST & CONTINUE.

To learn more about the Apache Kafka Source Connector, read Hevo Data’s Apache Kafka Documentation

Step 2: Configuring Databricks as Your Destination

Hevo Data allows you to load data from Kafka to Databricks data warehouse hosted on any cloud platform like AWS, Azure, or GCP using one of the following ways: 

  • The Databricks Partner Connect
  • The Databricks Credentials
1. Set up Databricks as the Destination Using the Databricks Partner Connect

Before getting into the configuration, start by ensuring the following prerequisites are in place:

  • An active Azure, AWS, or GCP cloud service account.
  • A Databricks workspace in your respective cloud service account.
  • You must have active connections from your region’s Hevo IP addresses to your workspace by enabling the IP access lists option in your cloud service provider.
  • The URL of your Databricks workspace should be in the format of https://<deployment name>.cloud.databricks.com.

Example: If the deployment name is dbc-westeros, the URL of your workspace would be http://dbc-westeros.cloud.databricks.com.

  • You must have a Team Collaborator or any administrator role (except the Billing Administrator role) in Hevo.

Let’s get into the detailed steps to configure Databricks as the destination using Databricks Partner Connect:

  1. Sign in to your Databricks account.
  2. Click on the Partner Connect option in the left navigation pane.
Apache Kafka to Databricks: Accessing Partner Connect from the Databricks interface
Apache Kafka to Databricks: Accessing Partner Connect from the Databricks interface
  1. From the Partner Connect page, click HEVO under the Data Ingestion section.
Apache Kafka to Databricks: Partner Connect Page
Apache Kafka to Databricks: Partner Connect Page
  1. A pop-up window called Connect to Partner will appear on your screen. You can choose the required options per your needs and click on the Next button.
  2. Add your active email address in the Email field, and click the Connect to Hevo Data button.
  3. Sign in to your Hevo account or create a new one.
  4. After you login, you will be redirected to the Configure your Databricks Destination page, where you should fill in the following fields.
Apache Kafka to Databricks: Configure your Databricks Destination
Apache Kafka to Databricks: Configure your Databricks Destination
  • Destination Name: A unique destination name within 0 to 255 characters.
  • Schema Name: The database schema name for the destination whose default value is default.
  • Advanced Settings: 
    • If the Populate Loaded Timestamp option is enabled, Hevo adds the ___hevo_loaded_at_ column at the end of the destination table to denote the time at which the event was loaded.
    • If the Sanitize Table/Column Names option is enabled, Hevo will remove non-alphanumeric characters and spaces from the column and table names and replace them with an underscore(_).
    • If the Create Delta Tables in External Location (Optional) is enabled, Hevo will create an external Delta table in /{schema}/{table} path.
    • If the Vacuum Delta Tables option is enabled, Hevo executes the Vacuum command every weekend to remove the uncommitted files and clean the Delta Tables. 
    • If the Optimize Delta Tables option is enabled, Hevo executes the Optimize queries every weekend to improve the data layout and query speed.
  1. Click on the TEST CONNECTION button.
  2. Click on the SAVE & CONTINUE button. 

According to Hevo Data, Databricks Partner Connect is the recommended method to connect Kafka to Databricks. To read more about Databricks Partner Connect, refer to the Hevo Documentation for Databricks Partner Connect.

To learn about the configuration of Databricks as a destination using Databricks Credentials, read Connect Using Databricks Credentials

Employing either of the two Hevo Data approaches, you can quickly ingest Kafka to Databricks, and ensure a smooth data migration.

Learn more about Kafka Hadoop Integration.

SIGN UP HERE FOR A 14-DAY FREE TRIAL

Method 2: Migrate Data from Apache Kafka to Databricks Using a Custom Method

This method involves setting up Apache Kafka on AWS EC2 machines and connecting them with the Databricks. To learn more about AWS EC2 machines, refer to Amazon EC2 documentation.

Let’s look into the step-by-step process in detail:

Step 1: Create a New Virtual Private Cloud (VPC) in AWS

1. To create a new VPC, you need to set the range of the new VPC CIDR to be different from the range of the Databricks VPC CIDR.

Example: 

The VPC ID of the Databricks is vpc-7f4c0d18, and its CIDR IP range is 10.205.0.0/16.

Apache Kafka to Databricks: Range of the Databricks VPC CIDR
Apache Kafka to Databricks: Range of the Databricks VPC CIDR

The new VPC ID is vpc-8eb1faf7, and its CIDR IP range is 10.10.0.0/16.

Apache Kafka to Databricks: Range of the New VPC CIDR
Apache Kafka to Databricks: Range of the New VPC CIDR

2. Build a new Internet gateway by clicking on the Create Internet gateway.

Apache Kafka to Databricks: Building a New Internet Gateway
Apache Kafka to Databricks: Building a New Internet Gateway

3. Attach the created Internet gateway to the new VPC’s route table where the ID of the new VPC is vpc-8eb1faf7

Apache Kafka to Databricks: Attaching the Internet Gateway to VPC
Apache Kafka to Databricks: Attaching the Internet Gateway to VPC

Step 2: Launch the AWS EC2 Instance in the New VPC 

In this step, you can review the AWS EC2 launch in the new VPC vpc-8eb1faf7.

Apache Kafka to Databricks: Launching EC2 Instance in the New VPC
Apache Kafka to Databricks: Launching EC2 Instance in the New VPC

Step 3: Install Kafka and ZooKeeper on the New EC2 Instance

1. Use Secure Shell (SSH) protocol in the EC2 machine with a key pair using the following command:

ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com

2. Download the Kafka archive and extract it using the following command:

wget https://apache.claz.org/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz
tar -zxf kafka_2.12-0.10.2.1.tgz

3. Begin the ZooKeeper using the following command:

cd kafka_2.12-0.10.2.1

bin/zookeeper-server-start.sh config/zookeeper.properties

4. Modify the config/server.properties file and set the private IP address of the EC2 instance as 10.10.143.166 using the following command

advertised.listeners=PLAINTEXT:/10.10.143.166:9092

5. Begin the Kafka broker using the following command:

cd kafka_2.12-0.10.2.1

bin/kafka-server-start.sh config/server.properties

Step 4: Peering Two VPCs

1. Establish a new peer connection by clicking on the Create Peering Connection button.

Apache Kafka to Databricks: Create a Peering Connection
Apache Kafka to Databricks: Create a Peering Connection

2. Include the peering connection to the route tables of your Databricks VPC and the new VPC for Apache Kafka.

  • In the Apache Kafka VPC, navigate to the route table and include the route to the VPC for Databricks.
Apache Kafka to Databricks: Adding the route to the Databricks VPC in the Kafka VPC
Apache Kafka to Databricks: Adding the route to the Databricks VPC in the Kafka VPC
  • In the Databricks VPC, navigate to the route table and include the route to the Kafka VPC.
Apache Kafka to Databricks: Adding route to the Kafka VPC in the Databricks VPC
Apache Kafka to Databricks: Adding route to the Kafka VPC in the Databricks VPC

To learn more information, read VPC Peering.

Step 5: Accessing the Kafka Broker from a Databricks Notebook

1. Test whether you can reach the EC2 instance executing the Kafka broker using telnet.

Apache Kafka to Databricks: Testing EC2 instance using telnet
Apache Kafka to Databricks: Testing EC2 instance using telnet

2. Use SSH to the Kafka broker using the following command:

%sh

ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com

3. Generate a new topic in the Kafka broker using the following command:

%sh

bin/kafka-console-producer.sh --broker-list localhost:9092 --article wordcount < LICENSE

4. Access data from the Databricks notebook.

%scala

import org.apache.spark.sql.functions._

val kafka = spark.readStream

        .format("kafka")

        .option("kafka.bootstrap.servers", "10.10.143.166:9092")

        .option("subscribe", "wordcount")

        .option("startingOffsets", "earliest")

display(kafka)

You have successfully integrated Apache Kafka and Databricks using an AWS EC2 instance.

Limitations of Custom Method for Kafka to Databricks Data Transfer

  • High Technical Expertise: Additional technical expertise in AWS EC2 Services, ZooKeeper, and Telnet commands is required to integrate Kafka into Databricks. 
  • Time-Consuming Process: This method is time-consuming because you need to create a Virtual Private Cloud needs for both Kafka and Databricks. Subsequently, you must launch AWS EC2 instances within the Kafka VPC, and add routes to the routing tables of both Kafka VPC and Databricks VPC.
  • Lack of Real-time Integration: This method does not support real-time integration capabilities, which can affect the data analytics tasks and decision-making process.

Use Cases for Apache Kafka to Databricks Migration

  • High Scalability: Databricks can deal with fast-growing amounts of data without compromising performance because the application is highly scalable.
  • Fault-Tolerance: Databricks has a Resilience to Failure feature that helps in data processing and analytics tasks, ensuring you work well without interruptions. 

Conclusion

Apache Kafka to Databricks integration has revolutionized the real-time data processing and analytics tasks on vast amounts of data. You can integrate Apache Kafka to Databricks in many ways, but Hevo Data stands out because of its real-time ETL and no-code data pipeline feature. To learn about Hevo Data and Databricks Partner to Automate Data Integration for the Lakehouse, read this  Databricks Blog Post.

Visit our Website to Explore Hevo

Hevo can help you integrate your data from numerous sources and load them into a destination to analyze real-time data with a BI tool such as Tableau. It helps transfer data from source to a destination of your choice for free. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure. Check out the Hevo Pricing details here.  

Sign Up for a 14-day free trial with Hevo to start moving data from Kafka to Databricks right now!

Frequently Asked Questions (FAQs)

Q. When migrating from Kafka to Databricks using Hevo, how can I synchronize the new and updated Kafka topics with Databricks as a Destination connector?

With Hevo Data’s Data replication feature, you can synchronize all the new and updated records with Databricks based on the ingestion frequency. Upon creating a pipeline, Hevo performs an initial data synchronization where all the available data in the Kafka topics gets ingested into the Databricks. After the initial synchronization, Hevo monitors for new and updated Kafka topics. Updates are incrementally ingested into the Databricks at a defined ingestion frequency if new or updated Kafka topics are available.

Q. For a Kafka to Databricks Migration in Hevo Data, I want to create Delta Tables to store the Kafka topics in an external location. How can I locate an external location in the Databricks console?

If you can access the Databricks File System (DBFS), you can directly navigate to the left side of the Databricks console and click the Data option. On the top of the sliding bar, choose DBFS, and select or view the external path in which you want to create the Delta tables. For example, if the path is demo/default, the path to the external location can be demo/default/{schema}/{table}.

If you do not have access to the Databricks File System (DBFS), execute the following command in your Databricks Destination:

DESCRIBE TABLE EXTENDED <table_name> where table_name is the name of the delta table you want to create in the external location.

Sony Saji
Technical Writer, Hevo Data

Sony is a former Computer Science teacher turned technical content writer, specializing in crafting blogs and articles on topics such as ML, AI, Python Frameworks, and other emerging trends in Data Science and Analytics.

All your customer data in one place.