You are going about your day setting up and operating your organization’s data infrastructure and preparing it for further analysis. Suddenly, you get a request from one of your team members to replicate data from Kafka to Azure.
We are here to help you out with this requirement. You can transfer data from Kafka to Azure using Apache Kafka Connect. Or you can pick an automated tool to do the heavy lifting. This article provides a step-by-step guide to both of them.
How to Connect Kafka to Azure?
To replicate data from Kafka to Azure, you can either use Apache Kafka Connect or an automated tool.
Export Kafka to Azure using Apache Kafka Connect
The Apache Kafka Connect framework facilitates the connection and transfer of data between a Kafka cluster and external systems like HDFS, MySQL, and file systems. This guide demonstrates how to use Kafka Connect with Azure Event Hubs.
Making an Event Hubs Namespace
To be able to send and receive messages from an Event Hubs service, it is necessary to have an Event Hubs namespace. The Event Hubs connection string and fully qualified domain name (FQDN) should also be obtained for future use.
Cloning the example project
Clone the Azure Event Hubs repository and go to tutorials>connect.
git clone https://github.com/Azure/azure-event-hubs-for-kafka.git
cd azure-event-hubs-for-kafka/tutorials/connect
Configuring Kafka Connect for Event Hubs
When redirecting data from Kafka to Event Hubs using Kafka Connect, very little reconfiguration is needed. Here is an example of how to configure Kafka Connect to communicate with the Kafka endpoint on Event Hubs:
bootstrap.servers={YOUR.EVENTHUBS.FQDN}:9093 # e.g. namespace.servicebus.windows.net:9093
group.id=connect-cluster-group
# connect internal topic names, auto-created if not exists
config.storage.topic=connect-cluster-configs
offset.storage.topic=connect-cluster-offsets
status.storage.topic=connect-cluster-status
# internal topic replication factors - auto 3x replication in Azure Storage
config.storage.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
rest.advertised.host.name=connect
offset.flush.interval.ms=10000
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
# required EH Kafka security settings
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="{YOUR.EVENTHUBS.CONNECTION.STRING}";
producer.security.protocol=SASL_SSL
producer.sasl.mechanism=PLAIN
producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="{YOUR.EVENTHUBS.CONNECTION.STRING}";
consumer.security.protocol=SASL_SSL
consumer.sasl.mechanism=PLAIN
consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="{YOUR.EVENTHUBS.CONNECTION.STRING}";
plugin.path={KAFKA.DIRECTORY}/libs # path to the libs directory within the Kafka release
Running Kafka Connect
In this stage, a Kafka Connect worker is launched locally in distributed mode and the cluster state is maintained via Event Hubs.
- Step 1: Locally save the connect-distributed.properties file mentioned before. Make sure you use brackets to replace all values.
- Step 1: Go to the Kafka release location on your computer.
- Step 1: Run ./bin/connect-distributed.sh /PATH/TO/connect-distributed.properties. After you see “INFO Finished starting connectors and tasks,” the Connect worker REST API is ready for use.
Creating Connectors
You will be taken through spinning up the FileStreamSource and FileStreamSink connectors in this section.
- Step 1: Create a directory for the data files used for input and output.
mkdir ~/connect-quickstart
- Step 2: Generate two files: one containing seed data that the FileStreamSource connection reads, and the other containing destination data for our FileStreamSink connector.
seq 1000 > ~/connect-quickstart/input.txt
touch ~/connect-quickstart/output.txt
- Step 3: Create a connector for FileStreamSource. Make sure to change the curly braces to the path to your home directory.
curl -s -X POST -H "Content-Type: application/json" --data '{"name": "file-source","config": {"connector.class":"org.apache.kafka.connect.file.FileStreamSourceConnector","tasks.max":"1","topic":"connect-quickstart","file": "{YOUR/HOME/PATH}/connect-quickstart/input.txt"}}' http://localhost:8083/connectors
After running the above command, you should see the Event Hub connect-quickstart on your Event Hubs instance.
- Step 4: Verify the source connector’s status.
curl -s http://localhost:8083/connectors/file-source/status
To confirm that events have arrived in the connect-quickstart topic, use the Service Bus Explorer.
- Step 5: Create a FileStreamSink Connector. Make sure to change the curly braces with your home directory.
curl -X POST -H "Content-Type: application/json" --data '{"name": "file-sink", "config": {"connector.class":"org.apache.kafka.connect.file.FileStreamSinkConnector", "tasks.max":"1", "topics":"connect-quickstart", "file": "{YOUR/HOME/PATH}/connect-quickstart/output.txt"}}' http://localhost:8083/connectors
- Step 6: Check the status of the sink connector.
curl -s http://localhost:8083/connectors/file-sink/status
- Step 7: Check to make sure the data is identical in both files and that it has been replicated between them.
# read the file
cat ~/connect-quickstart/output.txt
# diff the input and output files
diff ~/connect-quickstart/input.txt ~/connect-quickstart/output.txt
You face a challenge when your business teams require fresh data from multiple reports every few hours. For them to make sense of this data in various formats, it must be cleaned and standardized. This eventually causes you to devote substantial engineering bandwidth to creating new data connectors. You must monitor any changes to these connectors and fix data pipelines ad hoc. This is needed to ensure a replication with zero data loss. These additional tasks consume forty to fifty percent of the time you could have spent on your primary engineering objectives.
How about you focus on more productive tasks than repeatedly writing custom ETL scripts, downloading, cleaning, and uploading CSV files? This sounds good, right?
In that case, you can…
Automate the Data Replication process using a No-Code Tool
Let’s get into the details of how a fully automated data pipeline helps you connect Kafka to Azure. It enables you to set up zero-code and zero-maintenance data pipelines that just work. By using an automated data pipeline to simplify your data replication needs, you can leverage its salient features:
- Fully Managed: You don’t need to dedicate any time to building your pipelines. You can monitor all the processes in your pipeline, thus giving you complete control over it.
- Data Transformation: An automated data pipeline provides a simple interface to cleanse, modify, and transform your data through drag-and-drop features and Python scripts. It can accommodate multiple use cases with its pre-load and post-load transformation capabilities.
- Faster Insight Generation: It offers near real-time data replication, so you have access to real-time insight generation and faster decision-making.
- Schema Management: With auto schema mapping feature, all your mappings will be automatically detected and managed to the destination schema.
- Scalable Infrastructure: With the increase in the number of sources and volume of data, it can automatically scale horizontally, handling millions of records per minute with minimal latency.
Azure is an upcoming destination for Hevo Data.
What can you hope to achieve by replicating data from Kafka to Azure?
Here are a few questions your data analysts can answer by replicating data from Kafka to Azure:
- Which message would move a customer through the lifecycle?
- Which channels have a good ROAS and are worth investing in?
- Which copy and channels are most effective for your target market?
- How can you make your web page more conversion driven?
- How does your customer LTV change as a result of various targeting, creatives, or products?
Key Takeaways
These data requests from your marketing and product teams can be effectively fulfilled by replicating data from Kafka to Azure. If data replication must occur every few hours, you will have to switch to a custom data pipeline. Instead of spending months developing and maintaining such data integrations, you can enjoy a smooth ride with Hevo’s 150+ plug-and-play integrations (including 40+ free sources such as Kafka).
The main benefit of using a data pipeline for Kafka to Azure is replicable patterns. Others are, trust in the accuracy of the data, agility and flexibility, and belief in the pipeline’s security. Consider your priorities and choose the option that fits your requirements.
Visit our Website to Explore Hevo
Saving countless hours of manual data cleaning & standardizing, Hevo’s pre-load data transformations get it done in minutes via a simple drag n drop interface or your custom python scripts. No need to go to your data warehouse for post-load transformations. You can simply run complex SQL transformations from the comfort of Hevo’s interface and get your data in the final analysis-ready form. We are happy to announce that we have launched Azure Synapse as a destination.
Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your data replication process. Check out the pricing details to understand which plan fulfills all your business needs.