Kafka Replication: Easy Guide 101

on Data Driven, Data Integration, Data Replication • December 7th, 2021 • Write for Hevo

Are you looking to set up Kafka Replication? Don’t worry, we have you covered. This blog will act as your guide in understanding how Kafka Replication works and how you can configure it easily.

Table of Contents

What is Kafka?

Kafka Logo
Image Source

Kafka is a stream-based, distributed message broker software that receives messages from publishers and distributes them to subscribers. Kafka stores messages in physically distributed Locations, Processes, Streams, and Response to events. 

To reduce the overhead of network round trips, Kafka groups messages together forming the “Message Set” abstraction, which leads to larger Network Packets, larger Sequential Disk operations, contiguous Memory Blocks, etc., allowing Kafka to turn a bursty stream of random message writes into linear writes. Kafka is used for Event Processing, Real-Time Monitoring, Log Aggregation, and Queuing. 

In the next sections, you will understand Data Organization in Kafka and also learn about Kafka Replication in detail.

Simplify ETL with Hevo’s No-code Data Pipelines

Hevo Data, a No-Code Data Pipeline, helps to transfer data from 100+ sources including 30+ Free Sources to your desired data warehouse/ destination and visualize it in a BI Tool. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using various BI tools such as Power BI, Tableau, etc. 

GET STARTED WITH HEVO FOR FREE

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your data analysis with Hevo today!

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Data Organization in Kafka 

Kafka manages data in logically separate Topic. A Topic is a collection of semantically similar records. e.g. Locational data of all parcels in transit can form a Topic. 

The records within a Topic are stored in partitions where each partition can be stored in a separate machine, easing parallel reads and availability. The number of partitions in a Topic must be declared at the time of Topic creation. A low number of partitions eases Distributed Clustering, and a higher number of partitions per topic will lead to improved Throughput but a higher risk of Unavailability and higher end-to-end Latency. 

kafka replication - data organisation
Image Source

Each message in a partition is assigned a unique integer value called Offset. Kafka assures that Offset i will always be processed before offset i+1. Within a partition, all messages are stored in a sorted manner, based on each message’s Offset. This arrangement creates what is called a “Write-Ahead Log“.

Now, that you have understood how the data is organized in Kafka, let’s discuss what is Kafka Replication in the next section.

What is Kafka Replication?

kafka replication zookeper
Image Source

In this section, you will understand Kafka Replication. In addition, you will learn about how Zookeeper helps in Kafka Replication.

In Kafka parlance, Kafka Replication means having multiple copies of the data, spread across multiple servers/brokers. This helps in maintaining high availability in case one of the brokers goes down and is unavailable to serve the requests. Before we discuss methods to achieve useful Kafka Replication, let’s familiarize ourselves with some key concepts and terminology. 

Kafka Replication is allowed at the partition level, copies of a partition are maintained at multiple broker instances using the partition’s Write-Ahead Log. Amongst all the replicas of a partition, Kafka designates one of them as the “Leader” partition and all other partitions are followers or “in-sync” partitions. 

The Leader is responsible for receiving as well as sending data, for that partition. The total number of replicas including the leader constitute the Replication factor. To maintain these clusters and the topics/partitions within, Kafka has a centralized service called the Zookeeper

Zookeeper takes care of the synchronization between the distributed clusters and manages the configurations, controlling and naming. Zookeeper Atomic Broadcast (ZAB) protocol is the brain of the whole system. Each replica or Node, sends a “Keep-Alive” message to Zookeeper at regular intervals, thereby informing the Zookeeper that it’s alive and functional. If the Zookeeper does not receive this Keep-Alive message ( called heartbeat) within the designated configurable time ( 6000ms, by default), it assumes that the node is dead and if this node was a leader, a new leader election takes place. 

The parameter zookeeper.session.timeout.ms milliseconds is set to 6000 by default. Also, the node must not have a substantial backlog of messages that it did not receive from the Leader and did not process, i.e., the difference between the Leader’s Offset and Replica’s Offset must be less than a prescribed limit. 

The parameter replica.lag.max.messages, decides the allowed difference between Replica’s Offset and Leader’s Offset. If this difference is more than replica.lag.max.messages-1, then the node is considered lagging behind and is removed from the list of in-sync nodes, by the leader. 

Hence, a node is considered alive by Kafka if and only if, it meets the following two conditions:

  •  A node must be able to maintain its session with the ZooKeeper via ZooKeeper’s heartbeat mechanism. 
  •  If it is a follower it must replicate the writes happening on the leader and not fall “too far” behind. 

All nodes that are alive and in-sync, form the In-Sync Replica Set (ISR). Now, if all the in-sync nodes have applied a message to their respective logs, this message is considered committed and is then sent to the consumers. This way, Kafka guarantees that a committed message will not be lost, as long as there is at least one alive and in sync replica, at all times.
An out-of-sync node is allowed to rejoin the ISR if it can re-sync fully again, even if it lost some data due to its crash.

Some other important parameters to be configured are:

  • min.insync.replicas: Specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. 
  • offsets.retention.check.interval.ms: Frequency at which to check for stale Offsets.
  • offsets.topic.segment.bytes: This should be kept relatively small in order to facilitate faster Log Compaction and Cache Loads.
  • replica.lag.time.max.ms: If the follower has not consumed the Leaders log OR sent fetch requests, for at least this much time, it is removed from the ISR.
  • replica.fetch.wait.max.ms: Max wait time for each fetcher request issued by follower replicas, must be less than the replica.lag.time.max.ms to avoid shrinking of ISR.
  • transaction.max.timeout.ms: In case a client requests a timeout greater than this value, it’s not allowed so as to not stall other consumers. 
  • zookeeper.session.timeout.ms: Zookeeper session timeout.
  • zookeeper.sync.time.ms: How far a follower can be behind a Leader, setting this too high can result in an ISR that has potentially many out-of-sync nodes.

Conclusion

To summarize, we have discussed the importance of Kafka and how to use it to its optimal efficiency.  If you’re looking to enhance the scalability, fault-tolerance, and other features for an optimized Kafka Replication, this is a combination that you must implement. If you’re comfortable with manually configuring the Kafka Replication, you can follow the above-mentioned steps for Kafka Replication.

VISIT OUR WEBSITE TO EXPLORE HEVO

Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools, allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.

Want to take Hevo for a spin? 

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Have any further queries about Kafka Replication? Get in touch with us in the comments section below.

No-Code Data Pipeline for all your Data