Kafka Replication: Easy Guide 101


Kafka Replication_Fi

Organizations employ data-driven ways to operate their business operations efficiently in order to stay ahead of their competitors. Lagging applications or websites can stifle business expansion. Advanced-Data Pipelining and Data Streaming microservices are required to provide high performance, scalability, and flexibility to business applications.

Apache Kafka is an event streaming platform and a pub-sub system that allows users to read and publish data more easily. It is used by companies to distribute events at high throughput. Developers can use Kafka Data Pipeline to stream data in real-time from source to target with high throughput.

Are you looking to set up Kafka Replication? Don’t worry, we have you covered. This blog will act as your guide in understanding how Kafka Replication works and how you can configure it easily.

Table of Contents

What is Kafka?

Kafka Logo | Hevo Data
Image Source

Kafka is a stream-based, distributed message broker software that receives messages from publishers and distributes them to subscribers. Kafka stores messages in physically distributed Locations, Processes, Streams, and Response to events. 

To reduce the overhead of network round trips, Kafka groups messages together forming the “Message Set” abstraction, which leads to larger Network Packets, larger Sequential Disk operations, contiguous Memory Blocks, etc., allowing Kafka to turn a bursty stream of random message writes into linear writes. Kafka is used for Event Processing, Real-Time Monitoring, Log Aggregation, and Queuing. 

Key Features of Kafka

Apache Kafka is widely popular because of its capabilities that maintain availability, simplify scaling, and allow it to handle massive volumes, among other things. Take a look at some of the powerful features it provides:

  • Extensibility: Since Kafka’s recent popularity, various other software programs have developed connectors. This facilitates the creation of new features, such as integration with other programs. See how you can use Kafka to interface with Redshift and Salesforce.
  • Log Aggregation: Data recording from multiple system components must be centralized to a single area because a modern system is typically distributed. Kafka frequently acts as a single source of truth by centralizing data from all sources, regardless of shape or volume.
  • Stream Processing: It is Kafka’s main skill. It allows him to perform real-time calculations on Event Streams. From real-time data processing to dataflow programming, Kafka ingests, stores, and analyses stream of data as they are created.
  • Scalable: Kafka’s partitioned log model distributes data over multiple servers, allowing it to extend beyond the capacity of a single server.
  • Fast: Kafka decouples data streams, resulting in extremely low latency and great throughput.
  • Metrics and Monitoring: Kafka is frequently used to track operational data for metrics and monitoring. This requires gathering data from several apps and consolidating it into consolidated feeds with real-time measurements.
  • Durable: Data is written to a disc, and partitions are spread and duplicated across multiple servers to ensure durability. This protects data from server failure and makes it fault-tolerant and long-lasting.
  • Fault-Tolerant: The Kafka cluster can handle failures in the master and database. It’s capable of restarting the server on its own.

In the next sections, you will understand Data Organization in Kafka and also learn about Kafka Replication in detail.

Simplify ETL with Hevo’s No-code Data Pipelines

Hevo Data, a No-Code Data Pipeline, helps to transfer data from 100+ sources including 40+ Free Sources to your desired data warehouse/ destination and visualize it in a BI Tool. Hevo is fully managed and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using various BI tools such as Power BI, Tableau, etc. 


Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your data analysis with Hevo today!


Data Organization in Kafka 

Kafka manages data in logically separate Topic. A Topic is a collection of semantically similar records. e.g. Locational data of all parcels in transit can form a Topic. 

The records within a Topic are stored in partitions where each partition can be stored in a separate machine, easing parallel reads and availability. The number of partitions in a Topic must be declared at the time of Topic creation. A low number of partitions eases Distributed Clustering, and a higher number of partitions per topic will lead to improved Throughput but a higher risk of Unavailability and higher end-to-end Latency. 

kafka replication - data organisation | Hevo Data
Image Source

Each message in a partition is assigned a unique integer value called Offset. Kafka assures that Offset i will always be processed before offset i+1. Within a partition, all messages are stored in a sorted manner, based on each message’s Offset. This arrangement creates what is called a “Write-Ahead Log“.

Now, that you have understood how the data is organized in Kafka, let’s discuss what is Kafka Replication in the next section.

What is Kafka Replication?

kafka replication zookeper
Image Source

In this section, you will understand Kafka Replication. In addition, you will learn about how Zookeeper helps in Kafka Replication.

In Kafka parlance, Kafka Replication means having multiple copies of the data, spread across multiple servers/brokers. This helps in maintaining high availability in case one of the brokers goes down and is unavailable to serve the requests. Before we discuss methods to achieve useful Kafka Replication, let’s familiarize ourselves with some key concepts and terminology. 

Kafka Replication is allowed at the partition level, copies of a partition are maintained at multiple broker instances using the partition’s Write-Ahead Log. Amongst all the replicas of a partition, Kafka designates one of them as the “Leader” partition and all other partitions are followers or “in-sync” partitions. 

The Leader is responsible for receiving as well as sending data, for that partition. The total number of replicas including the leader constitute the Replication factor. To maintain these clusters and the topics/partitions within, Kafka has a centralized service called the Zookeeper

Zookeeper takes care of the synchronization between the distributed clusters and manages the configurations, controlling and naming. Zookeeper Atomic Broadcast (ZAB) protocol is the brain of the whole system. Each replica or Node, sends a “Keep-Alive” message to Zookeeper at regular intervals, thereby informing the Zookeeper that it’s alive and functional. If the Zookeeper does not receive this Keep-Alive message ( called heartbeat) within the designated configurable time ( 6000ms, by default), it assumes that the node is dead and if this node was a leader, a new leader election takes place. 

The parameter zookeeper.session.timeout.ms milliseconds is set to 6000 by default. Also, the node must not have a substantial backlog of messages that it did not receive from the Leader and did not process, i.e., the difference between the Leader’s Offset and Replica’s Offset must be less than a prescribed limit. 

The parameter replica.lag.max.messages, decides the allowed difference between Replica’s Offset and Leader’s Offset. If this difference is more than replica.lag.max.messages-1, then the node is considered lagging behind and is removed from the list of in-sync nodes, by the leader. 

Hence, a node is considered alive by Kafka if and only if, it meets the following two conditions:

  •  A node must be able to maintain its session with the ZooKeeper via ZooKeeper’s heartbeat mechanism. 
  •  If it is a follower it must replicate the writes happening on the leader and not fall “too far” behind. 

All nodes that are alive and in-sync, form the In-Sync Replica Set (ISR). Now, if all the in-sync nodes have applied a message to their respective logs, this message is considered committed and is then sent to the consumers. This way, Kafka guarantees that a committed message will not be lost, as long as there is at least one alive and in sync replica, at all times.
An out-of-sync node is allowed to rejoin the ISR if it can re-sync fully again, even if it lost some data due to its crash.

Some other important parameters to be configured are:

  • min.insync.replicas: Specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. 
  • offsets.retention.check.interval.ms: Frequency at which to check for stale Offsets.
  • offsets.topic.segment.bytes: This should be kept relatively small in order to facilitate faster Log Compaction and Cache Loads.
  • replica.lag.time.max.ms: If the follower has not consumed the Leaders log OR sent fetch requests, for at least this much time, it is removed from the ISR.
  • replica.fetch.wait.max.ms: Max wait time for each fetcher request issued by follower replicas, must be less than the replica.lag.time.max.ms to avoid shrinking of ISR.
  • transaction.max.timeout.ms: In case a client requests a timeout greater than this value, it’s not allowed so as to not stall other consumers. 
  • zookeeper.session.timeout.ms: Zookeeper session timeout.
  • zookeeper.sync.time.ms: How far a follower can be behind a Leader, setting this too high can result in an ISR that has potentially many out-of-sync nodes.

Kafka Replication can be a tiresome task without the right set of tools. Hevo’s Data Replication & Integration platform empowers you with everything you need to have a smooth Data Collection, Processing, and Replication experience. It helps you transfer data from a source of your choice without writing any code.

Why do Replicas Lag Behind?

For a variety of reasons, a follower clone may lag behind the leader.

  • Slow Replica: A replica may be unable to keep up with the rate at which the leader receives new messages if the rate at which the leader receives messages is greater than the rate at which the replica copies messages, resulting in an IO bottleneck.
  • Stuck Replica: If a replica has ceased requesting fresh messages from the leader owing to factors such as a dead replica or GC blocking the replica (Garbage collector).
  • Bootstrapping Replica: When the user increases the topic’s replication factor, the new follower replicas are out of sync until they catch up to the leader’s log.

How do you determine that a replica is lagging?

In all circumstances, this model for detecting out-of-sync stuck clones works well. It keeps track of how long a follower duplicate has been alive without sending a fetch request to the leader. The model for detecting out-of-sync slow replicas using the number of messages, on the other hand, only works well if these parameters are set for a single topic or multiple topics with similar traffic patterns, but it doesn’t scale to the variety of workloads across all topics in a production cluster.

For example, if foo receives data at a rate of 2 msg/sec and a single batch received on the leader rarely exceeds 3 messages, replica.lag.max.messages for that topic can be set to 4. Since the follower logs will be no more than three messages behind the leader’s logs after the largest batch is appended to the leader and before the follower replicas duplicate those messages. At the same time, you want the leader to remove the sluggish follower replica and prevent the message write latency from growing if the follower replicas for topic foo start lagging behind the leader by more than 3 messages.

The purpose of replica.lag.max.messages is to be able to detect replicas that are out of sync with the leader on a regular basis. Let’s say traffic on the same subject grows naturally or as a result of a spike, and the producer sends a batch of 4 messages, which is equal to the specified number for replica.lag.max.messages=4. Both follower replicas will be moved out of the ISR at that point since they are out of sync with the leader.

However, because both follower replicas are still alive, in the next retrieve request, they will catch up to the leader’s log end offset and be added back to the ISR. If the producer continues to transmit a huge batch of messages to the leader, the process will repeat. This illustrates the situation in which follower replicas shuttle in and out of the ISR, causing false warnings.
This is the root of the replica.lag.max.messages issue. It defines replication configurations using a value that the user must guess and is unsure of at the time of configuration.


To summarize, we have discussed the importance of Kafka and how to use it to its optimal efficiency.  If you’re looking to enhance the scalability, fault-tolerance, and other features for an optimized Kafka Replication, this is a combination that you must implement. If you’re comfortable with manually configuring the Kafka Replication, you can follow the above-mentioned steps for Kafka Replication.


Hevo Data, a No-code Data Pipeline, helps you transfer data from a source of your choice in a fully automated and secure manner without having to write the code repeatedly. Hevo, with its strong integration with 100+ sources & BI tools, such as Kafka allows you to not only export & load data but also transform & enrich your data & make it analysis-ready in a jiff.

Want to take Hevo for a spin? 

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Have any further queries about Kafka Replication? Get in touch with us in the comments section below.

Pratik Dwivedi
Technical Content Writer, Hevo Data

Pratik Dwivedi is a seasoned author specializing in data industry topics, including data analytics, machine learning, AI, big data, and business intelligence. With over 18 years of experience in system analysis, design, and implementation, including 8 years in a Techno-Managerial role, Pratik has successfully managed international clients and led small to medium-sized teams and projects. He excels in creating engaging content that informs and inspires.

No-Code Data Pipeline for all your Data