Easily move your data from Kafka To Databricks to enhance your analytics capabilities. With Hevo’s intuitive pipeline setup, data flows in real-time—check out our 1-minute demo below to see the seamless integration in action!
Many organizations or businesses utilize different data migrations to collect, process, and analyze large amounts of data in real-time to derive meaningful insights. For high-performance real-time data streaming and advanced analytics tasks, organizations rely on Apache Kafka and Databricks.
By migrating data from Apache Kafka to Databricks, you will get a highly scalable, low-latency, and fault-tolerant unified data platform for analyzing large datasets. Based on the analysis, you can respond to market conditions quickly and make timely decisions.
Let’s look into the detailed steps for migrating data from Apache Kafka and Databricks.
Kafka: An Overview
Kafka was initially designed as a distributed messaging queue at LinkedIn to facilitate activity tracking and real-time streaming across its various internal applications. Because of its low-latency feature, Kafka was later open-sourced to the Apache Software Foundation, which became one of the active Apache projects.
Apache Kafka is now an open-source distributed event streaming platform. Thousands of organizations utilize Kafka to build high-performance real-time data pipelines and event-driven architectures to conduct streaming data analytics.
The key features of Apache Kafka include high throughput, high availability, scalability, and permanent storage. It utilizes a cluster of machines to deliver messages within latencies less than two milliseconds. With Kafka, you can scale up or down the storage and processing resources and get a fault-tolerant and durable cluster to store your data securely.
You can use several options to access Apache Kafka, including Client Libraries like Java or Python, Command Line Interface tools, or Kafka REST Proxy.
Databricks: An Overview
Databricks is a data intelligence and unified analytics platform introduced by the creators of Apache Spark. Its goal is to bring AI to your data by unifying data science with engineering and business intelligence.
The Databricks Data Intelligence Platform is based on a Lakehouse architecture that allows everyone in your organization to derive insights from your data using natural language. It is done by performing Extract Transform Load (ETL), data warehousing, and developing AI applications on your data.
The Databricks Unified Analytics platform helps you quickly preprocess your data on a large scale, continuously train machine learning models, and deploy them for all your AI applications. It will help you build a generative AI application on your data without sacrificing data privacy.
Methods for Apache Kafka to Databricks Migration
To migrate data from Apache Kafka to Databricks, you can either use Hevo Data or a custom method.
Method 1: Using Hevo Data to Migrate Data from Apache Kafka to Databricks
Hevo Data is a real-time ELT, no-code, and cost-effective data pipeline platform that automates flexible data pipelines to your requirements. With integration to over 150+ data sources, Hevo Data helps you export data from sources, load it into destinations, and transform it for in-depth analysis.
Let’s look into the detailed steps to set up Apache Kafka to Databricks Hevo pipeline.
Step 1: Configuring Apache Kafka as Your Source
Before getting started, make sure that the following prerequisites are in place:
Here are the steps to configure the Apache Kafka as the source in your Hevo pipeline:
- Locate the Bootstrap Server Information
- Open the server.properties file from your file system and open it.
- Copy the complete line after bootstrap.servers:
Example: Bootstrap.servers: hostname1:9092, hostname2:9092
Copy hostname1:9092 and hostname2:9092 from the above line, where hostname1:9092 and hostname2:9092 indicate bootstrap server names and port numbers, respectively.
- Whitelist Hevo’s IP Addresses
To allow Hevo Data to connect to your Apache Kafka server, you must whitelist the Hevo IP address for your region in the Kafka server configuration file.
- Go to your Kafka server configuration directory and open the Kafka server configuration file by using the following sample command:
sudo nano /usr/local/etc/config/server.properties
Your path to the Kafka server configuration file can be different.
- Scroll down to the listeners section in the configuration file. You can add it to a new line if no such section exists.
- Add the following command under the listeners section:
<protocol>://0.0.0.0:<port> or <protocol>://<hevo_ip>:<port>
Where the <protocol> is an SSL or just a PLAINTEXT, the <hevo_ip> is your region’s Hevo IP address, and the <port> is the same port number used in your bootstrap server (9092).
- Save the Kafka server configuration file.
- Configure Apache Kafka Connection Settings
Follow the steps below to configure Apache Kafka as the source in your pipeline:
- In the Navigation Bar, click on the PIPELINES.
- Go to Pipelines List View and click on the +CREATE.
- Choose Kafka from the Select Source Type page, then select Apache Kafka from the Select your Kafka Variant page.
- When you’re redirecting to the Configure your Kafka Source page, you must fill in the following fields:
- Pipeline Name: A unique name for the source end of your pipeline, and its length should not exceed 255 characters.
- Bootstrap Server(s): Extract the bootstrap server information from the Apache Kafka.
Example: hostname1:9092
- Ingest Data From: You can choose any one of the following:
- All Topics: This option allows you to ingest data from all topics and automatically include the newly created topics.
- Specific Topics: This would be the best option for manually choosing a list of comma-separated topics. This option does not automatically add new topics.
- Topics Matching a Pattern (Regex): You can choose to specify a Regular Expression (regex) that selects topic names matching with the given regex pattern.
- Use SSL: If you want to utilize SSL-encrypted connection, enable this option and specify the following:
- CA File: This includes the SSL Server Certificate Authority (CA).
- If this option is selected, Hevo Data will load up to 50 CA certificates from the attached CA file.
- If not selected, Hevo Data will load only the first certificate.
- Client Certificate: This file contains the client’s public key certificate.
- Client Key: This file has the client’s private key file.
- Click on the TEST & CONTINUE.
To learn more about the Apache Kafka Source Connector, read Hevo Data’s Apache Kafka Documentation.
Step 2: Configuring Databricks as Your Destination
Hevo Data allows you to load data from Kafka to Databricks data warehouse hosted on any cloud platform like AWS, Azure, or GCP using one of the following ways:
- The Databricks Partner Connect
- The Databricks Credentials
1. Set up Databricks as the Destination Using the Databricks Partner Connect
Before getting into the configuration, start by ensuring the following prerequisites are in place:
- An active Azure, AWS, or GCP cloud service account.
- A Databricks workspace in your respective cloud service account.
- You must have active connections from your region’s Hevo IP addresses to your workspace by enabling the IP access lists option in your cloud service provider.
- The URL of your Databricks workspace should be in the format of https://<deployment name>.cloud.databricks.com.
Example: If the deployment name is dbc-westeros, the URL of your workspace would be http://dbc-westeros.cloud.databricks.com.
- You must have a Team Collaborator or any administrator role (except the Billing Administrator role) in Hevo.
Let’s get into the detailed steps to configure Databricks as the destination using Databricks Partner Connect:
- Sign in to your Databricks account.
- Click on the Partner Connect option in the left navigation pane.
- From the Partner Connect page, click HEVO under the Data Ingestion section.
- A pop-up window called Connect to Partner will appear on your screen. You can choose the required options per your needs and click on the Next button.
- Add your active email address in the Email field, and click the Connect to Hevo Data button.
- Sign in to your Hevo account or create a new one.
- After you login, you will be redirected to the Configure your Databricks Destination page, where you should fill in the following fields.
- Destination Name: A unique destination name within 0 to 255 characters.
- Schema Name: The database schema name for the destination whose default value is default.
- Advanced Settings:
- If the Populate Loaded Timestamp option is enabled, Hevo adds the ___hevo_loaded_at_ column at the end of the destination table to denote the time at which the event was loaded.
- If the Sanitize Table/Column Names option is enabled, Hevo will remove non-alphanumeric characters and spaces from the column and table names and replace them with an underscore(_).
- If the Create Delta Tables in External Location (Optional) is enabled, Hevo will create an external Delta table in /{schema}/{table} path.
- If the Vacuum Delta Tables option is enabled, Hevo executes the Vacuum command every weekend to remove the uncommitted files and clean the Delta Tables.
- If the Optimize Delta Tables option is enabled, Hevo executes the Optimize queries every weekend to improve the data layout and query speed.
- Click on the TEST CONNECTION button.
- Click on the SAVE & CONTINUE button.
According to Hevo Data, Databricks Partner Connect is the recommended method to connect Kafka to Databricks. To read more about Databricks Partner Connect, refer to the Hevo Documentation for Databricks Partner Connect.
To learn about the configuration of Databricks as a destination using Databricks Credentials, read Connect Using Databricks Credentials.
Employing either of the two Hevo Data approaches, you can quickly ingest Kafka to Databricks, and ensure a smooth data migration.
Learn more about Kafka Hadoop Integration.
Kafka to Databricks in 2 Steps
No credit card required
Method 2: Migrate Data from Apache Kafka to Databricks Using a Custom Method
This method involves setting up Apache Kafka on AWS EC2 machines and connecting them with the Databricks. To learn more about AWS EC2 machines, refer to Amazon EC2 documentation.
Let’s look into the step-by-step process in detail:
Step 1: Create a New Virtual Private Cloud (VPC) in AWS
1. To create a new VPC, you need to set the range of the new VPC CIDR to be different from the range of the Databricks VPC CIDR.
Example:
The VPC ID of the Databricks is vpc-7f4c0d18, and its CIDR IP range is 10.205.0.0/16.
The new VPC ID is vpc-8eb1faf7, and its CIDR IP range is 10.10.0.0/16.
2. Build a new Internet gateway by clicking on the Create Internet gateway.
3. Attach the created Internet gateway to the new VPC’s route table where the ID of the new VPC is vpc-8eb1faf7
Step 2: Launch the AWS EC2 Instance in the New VPC
In this step, you can review the AWS EC2 launch in the new VPC vpc-8eb1faf7.
Step 3: Install Kafka and ZooKeeper on the New EC2 Instance
1. Use Secure Shell (SSH) protocol in the EC2 machine with a key pair using the following command:
ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
2. Download the Kafka archive and extract it using the following command:
wget https://apache.claz.org/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz
tar -zxf kafka_2.12-0.10.2.1.tgz
3. Begin the ZooKeeper using the following command:
cd kafka_2.12-0.10.2.1
bin/zookeeper-server-start.sh config/zookeeper.properties
4. Modify the config/server.properties file and set the private IP address of the EC2 instance as 10.10.143.166 using the following command
advertised.listeners=PLAINTEXT:/10.10.143.166:9092
5. Begin the Kafka broker using the following command:
cd kafka_2.12-0.10.2.1
bin/kafka-server-start.sh config/server.properties
Step 4: Peering Two VPCs
1. Establish a new peer connection by clicking on the Create Peering Connection button.
2. Include the peering connection to the route tables of your Databricks VPC and the new VPC for Apache Kafka.
- In the Apache Kafka VPC, navigate to the route table and include the route to the VPC for Databricks.
- In the Databricks VPC, navigate to the route table and include the route to the Kafka VPC.
To learn more information, read VPC Peering.
Step 5: Accessing the Kafka Broker from a Databricks Notebook
1. Test whether you can reach the EC2 instance executing the Kafka broker using telnet.
2. Use SSH to the Kafka broker using the following command:
%sh
ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
3. Generate a new topic in the Kafka broker using the following command:
%sh
bin/kafka-console-producer.sh --broker-list localhost:9092 --article wordcount < LICENSE
4. Access data from the Databricks notebook.
%scala
import org.apache.spark.sql.functions._
val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.10.143.166:9092")
.option("subscribe", "wordcount")
.option("startingOffsets", "earliest")
display(kafka)
You have successfully integrated Apache Kafka and Databricks using an AWS EC2 instance.
Limitations of Custom Method for Kafka to Databricks Data Transfer
- High Technical Expertise: Additional technical expertise in AWS EC2 Services, ZooKeeper, and Telnet commands is required to integrate Kafka into Databricks.
- Time-Consuming Process: This method is time-consuming because you need to create a Virtual Private Cloud needs for both Kafka and Databricks. Subsequently, you must launch AWS EC2 instances within the Kafka VPC, and add routes to the routing tables of both Kafka VPC and Databricks VPC.
- Lack of Real-time Integration: This method does not support real-time integration capabilities, which can affect the data analytics tasks and decision-making process.
Ingest Data from Kafka to Databricks
Ingest Data from Kafka to Redshift
Ingest Data from Kafka to Snowflake
Use Cases for Apache Kafka to Databricks Migration
- High Scalability: Databricks can deal with fast-growing amounts of data without compromising performance because the application is highly scalable.
- Fault-Tolerance: Databricks has a Resilience to Failure feature that helps in data processing and analytics tasks, ensuring you work well without interruptions.
Conclusion
Apache Kafka to Databricks integration has revolutionized the real-time data processing and analytics tasks on vast amounts of data. You can integrate Apache Kafka to Databricks in many ways, but Hevo Data stands out because of its real-time ETL and no-code data pipeline feature. To learn about Hevo Data and Databricks Partner to Automate Data Integration for the Lakehouse, read this Databricks Blog Post.
Visit our Website to Explore Hevo
Hevo can help you integrate your data from numerous sources and load them into a destination to analyze real-time data with a BI tool such as Tableau. It helps transfer data from source to a destination of your choice for free. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure. Check out the Hevo Pricing details here.
Sign Up for a 14-day free trial with Hevo to start moving data from Kafka to Databricks right now!
Frequently Asked Questions (FAQs)
When migrating from Kafka to Databricks using Hevo, how can I synchronize the new and updated Kafka topics with Databricks as a Destination connector?
With Hevo Data’s Data replication feature, you can synchronize all the new and updated records with Databricks based on the ingestion frequency. Upon creating a pipeline, Hevo performs an initial data synchronization where all the available data in the Kafka topics gets ingested into the Databricks. After the initial synchronization, Hevo monitors for new and updated Kafka topics. Updates are incrementally ingested into the Databricks at a defined ingestion frequency if new or updated Kafka topics are available.
For a Kafka to Databricks Migration in Hevo Data, I want to create Delta Tables to store the Kafka topics in an external location. How can I locate an external location in the Databricks console?
If you can access the Databricks File System (DBFS), you can directly navigate to the left side of the Databricks console and click the Data option. On the top of the sliding bar, choose DBFS, and select or view the external path in which you want to create the Delta tables. For example, if the path is demo/default, the path to the external location can be demo/default/{schema}/{table}.
If you do not have access to the Databricks File System (DBFS), execute the following command in your Databricks Destination:
DESCRIBE TABLE EXTENDED <table_name> where table_name is the name of the delta table you want to create in the external location.
Sony is a technical writer with over six years of experience, including three years as a writer and three years as a teacher. She leverages her Master’s degree in Computer Science to craft engaging and informative articles that span a broad spectrum of topics within data science, machine learning, and AI. Her dedication to excellence and passion for education are evident in her numerous published works, enlightening and empowering data professionals.