Apache Kafka is an open-source data streaming platform that collects, stores, organizes, and manages real-time data moving into Kafka servers. Since Kafka servers can stream trillions of real-time data or records per day, they often get overloaded because of the continuous streaming of real-time messages. To eliminate such complications, you can implement a Kafka compacted topic strategy through which you can manually select and remove the replicated or duplicated messages inside Kafka servers.

In this article, you will gain information about Kafka Compacted Topic. You will also gain a holistic understanding of Apache Kafka, its key features, Kafka Topics, Partition, and the steps for creating a Kafka Compacted Topic to implement the process of Log Compaction in Kafka. Read along to find out in-depth information about Kafka Compacted Topics.

Prerequisites

  • A fundamental understanding of real-time data streaming
  • Basic understanding of Apache Kafka

What is Apache Kafka?

Kafka logo

Apache Kafka was originally developed at LinkedIn to address their need for Monitoring Activity Stream Data and Operational Metrics such as CPU, I/O usage, and request timings. Subsequently, in early 2011, it was Open-Sourced through the Apache Software Foundation. Apache Kafka is a Distributed Event Streaming Platform written in Java and Scala. It is a Publish-Subscribe (pub-sub) Messaging Solution used to create Real-Time Streaming Data Pipelines and applications that adapt to the Data Streams.

Kafka deals with Real-Time volumes of data and swiftly routes it to various consumers. It provides seamless integration between the information of producers and consumers without obstructing the producers and without revealing the identities of consumers to the producers. 

There are three main components in Kafka architecture: Kafka producers, brokers or servers, and consumers. Kafka producers publish or write real-time messages into Kafka servers, while Kafka consumers fetch or read messages from Kafka servers. Since Kafka is used by various external applications to write (publish) and read (subscribe) real-time messages to and fro the Kafka servers, it is also called a publish-subscribe messaging service. Today, because of its rich features and functionalities, Kafka is being used by the world’s most prominent companies like Spotify, Netflix, Airbnb, and Uber for implementing real-time streaming services.

Kafka Core concepts: 

  • Producer: An application that sends data (message records) to the Kafka server.
  • Consumer: An application that receives data from the Kafka server in the form of message records.
  • Broker: A Kafka Server that acts as an agent/broker for message exchange.
  • Cluster: A collection of computers that each run one instance of the Kafka broker.
  • Topic: The data stream is given an arbitrary name.
  • Zookeeper: A server/broker that stores a large number of shared pieces of information.
Effortless Kafka Data Management with Hevo

Looking for an easier way to manage Kafka data? Hevo’s no-code pipeline seamlessly integrates Kafka with your data warehouse for efficient transfer, transformation, and analysis. 

Hevo offers:

Thousands of customers worldwide trust Hevo for their data ingestion needs. Join them and experience seamless data ingestion.

Get Started with Hevo for Free

Key Features of Apache Kafka 

Apache Kafka provides the following features such as communicating through messaging and stream processing to enable real-time data storage and analysis.    

  • Persistent messaging: Any type of information loss cannot be tolerated in order to gain real value from big data. Apache Kafka is built with O(1) Disc Structures that deliver constant-time performance even with very high volumes of stored messages (in the TBs).
  • High Throughput: Kafka was designed to work with large amounts of data and support Millions of Messages per Second.
  • Distributed event streaming platform: Apache Kafka facilitates Message Partitioning across Kafka servers and distributing consumption over a cluster of consumer systems while ensuring per-partition ordering semantics.
  • Real-time solutions: Messages created by producer threads should be instantly available to consumer threads. This characteristic is essential in event-based systems like Complex Event Processing (CEP).

What are Kafka Topics?

In Kafka servers, real-time messages are stored inside topics in the form of Kafka partitions. A single Kafka topic can have multiple partitions, allowing Kafka to attain maximum fault tolerance by scaling topics across various servers or brokers in Kafka architecture, as shown in the image below.

Partitions in Kafka are nothing but the collection of segments inside which real-time messages are separately stored in a sequenced or ordered format, as shown below. The Kafka topic partition follows the append-only mechanism that organizes the incoming messages with respect to their arrival time.

Furthermore, Kafka producers publish messages into Kafka topics in a key-value pair format. Inside Kafka topics, the topic partition always contains records or messages in the form of key-value pairs. In the Key-value pair format, the key is a parameter used for record partitioning while the value represents the specific data produced by Kafka producers.

For example, consider the following image where the Offset parameter represents the sequential order of messages according to the arrival time. The Key and Value parameters represent the respective Key-Value pair through which Kafka producers published messages inside Kafka topics.

Using the above-given partitioning mechanism, Kafka producers continuously publish real-time messages into Kafka topics inside Kafka servers. Kafka consumers fetch real-time messages from the respective Kafka servers according to their business needs or use cases. However, Kafka has a certain retention policy or time period for how long it tends to store real-time messages inside Kafka servers. 

Apache Kafka’s retention policy is of two types:

  • Time-based retention
  • Size-based retention.

Kafka producers can set the retention time period for how long the published messages should be stored inside Kafka servers. This is called Time-based retention. The default retention time period of Kafka is seven days. You can increase and decrease the retention time period based on your preferences and use cases. 

Size-based retention is the policy by which Kafka servers store or retain real-time messages only up to a certain storage limit. Once the limit is reached, the real-time messages are removed from the Kafka server. There is also a third upgraded way of retaining real-time messages in the Kafka server for a more extended period of time. This scenario of message retention in Kafka servers is called Log Compaction, which is one of the message cleanup policies of Kafka.

What are Compacted Topics in Apache Kafka?

In Apache Kafka, compacted topics are a specialized type of topic designed to store the most recent value for each unique key within the topic. This mechanism is crucial for scenarios where maintaining the latest state for specific keys is paramount.

How Compaction Works?

Head and Tail: A compacted topic consists of two distinct sections:

  • Head: The active portion of the topic where new messages are continuously appended. This section may contain duplicate keys.
  • Tail: This section stores only the most recent value for each unique key. Kafka ensures that no duplicate keys exist in the tail.

Compaction Process:

  • Kafka continuously monitors the ratio of “dirty” data (data containing duplicate keys) in the head to the total data within the topic.
  • When this ratio exceeds a predefined threshold (typically 50% by default), the compaction process is triggered.
  • During compaction, Kafka iterates through the head, identifying and removing duplicate records for each key.
  • The unique, most recent value for each key is then moved to the tail section.
Integrate Kafka to BigQuery
Integrate Kafka to Snowflake
Integrate Kafka to Azure Synapse Analytics

What is the Log Compaction process in Kafka?

Unlike the retention policy criteria, where Kafka automatically removes messages from Kafka servers when the time or size is reached, you can clean up messages manually with the help of the Log Compaction Process. In other words, using the log compaction method, you can selectively remove messages from each topic partition where the records are replicated or present more than once. 

Usually, Kafka Producers publish messages into Kafka Topics in the form of key-value pairs, where key parameters have their respective values or messages. In some cases, Kafka producers might continuously publish new messages under the same key in a topic partitioned log. Consequently, keys in Kafka partitions are replicated with different values, causing Kafka consumers to face certain complexities during the message consuming process. To eliminate such complexities and make Kafka servers loaded with appropriate data, Kafka producers follow a log compaction process that selectively removes old messages whenever the topic partition is updated with the latest record under the same key. 

For example, consider the above-given image. Initially, the offset of the topic partition is 1, 2, 3, and 4 with three different keys such as p3, p5, p3, and p4, where p3 is repeated twice. The keys p5 and p6 have distinct values as 7$ and 25$, while the p3 key has two different values, such as 10$ and 11$. In such cases, you can implement the process of log compaction to selectively remove the old values along with the repetitive keys based on an offsetting order of a single topic partition. This log compaction process retains at least the last known or latest record of a repetitive key in a topic partition inside Kafka servers. 

After implementing log compaction in the background thread, the topic partition will resemble, as shown in the above image. The log compaction process removed the old record 10$ of the p3 key at offset 1 while retaining the latest record 11$ of the same key. 

At this stage, assume a new set of Kafka producers appends another set of real-time messages into the same Kafka partitions with the duplicate keys such as p5 and p6. As shown in the above image, the new set of messages is appended to the end of the partition log. On appending the new set of messages, the offset count is also increased in the respective topic partition. Now, the background thread of Kafka predicts the repetition of messages or values under the same key and is ready to implement the log compaction process. 

After implementing log compaction, the topic partition resembles the image given above. In the existing topic partition, there are three keys, such as p3, p5, and p6, where p5 and p6 are repeated thrice and twice, respectively. With log compaction, the older values with duplicate keys are removed while retaining the newly arrived messages with distinct keys in the topic partition. Finally, the retained values are 11$, 12$, and 17$, having unique keys such as p3, p6, and p5, respectively. With this log compaction process, Kafka ensures, at least, to retain the last state of each key present in the Kafka partition inside Kafka servers.

Creating a Kafka Compacted Topic for Log Compaction

Since Log compaction is a command-based mechanism in Kafka, you have to execute in-line commands using a CLI (Command Line Interface) tool for implementing the process of log compaction in topic partitions. For continuing with the log compaction process of creating Kafka Log Compacted Topic, you have to set up and start the Kafka environment. 

  • Step 1: Initially, you have to run the Kafka server and Zookeeper instances. Execute the following commands to start the Kafka environment. 
  1. To start the Zookeeper instance
bin/zookeeper-server-start.sh config/zookeeper.properties
  1. To start the Kafka server 
bin/kafka-server-start.sh config/server.properties
  • Step 2: On executing the commands given above, you successfully started the Kafka server and Zookeeper instance. Now, open a new command prompt and run the following command to create a log Kafka compacted topic. 
kafka-topics --create --zookeeper zookeeper:2181 --topic latest-product-price --replication-factor 1 --partitions 1 --config "cleanup.policy=compact" --config "delete.retention.ms=100"  --config "segment.ms=100" --config "min.cleanable.dirty.ratio=0.01"
  • Step 3: The above command creates a new Kafka topic named latest-product-price with a single partition. The parameter cleanup.policy=compact starts the log compaction process while delete.retention.ms=100 parameter indicates how long you can retain the deletion marker after the cleanup process. Now, you have to produce some records inside the newly created Kafka compacted topic to check whether the log compaction process is implemented correctly. 
  • Step 4: Now, you can start a new producer and consumer console in Kafka by executing the following commands. 
  1. To start a producer console: 
kafka-console-producer --broker-list localhost:9092 --topic latest-product-price --property parse.key=true --property key.separator=:
  1. To start a consumer console: 
kafka-console-consumer --bootstrap-server localhost:9092 --topic latest-product-price --property  print.key=true --property key.separator=: --from-beginning
  • Step 5: In the newly created producer console, publish the following messages in the form of key-value pair, as shown in the below image. 
  • Step 6: Now, the above message is consumed in the consumer panel. The consumed message will resemble the below-given image.
  • Step 7: On examining the above output, you can confirm that the log compaction process of building Kafka Compacted Topic is implemented in the topic partition. Kafka removed the old value of the duplicated keys while retaining the latest value so that keys such as p3, p6, and p5 have their respective unique values inside a topic partition.

Conclusion 

In this article, you have learned about Kafka Compacted Topic. This article also provided information on Apache Kafka, its key features, Kafka Topics, Partition, and the steps for creating a Kafka Compacted Topic to implement the process of Log Compaction in Kafka in detail. However, there are other message retention techniques like log cleaning and log deletion to implement record cleaning strategies in Kafka partitions. On keeping the fundamental understanding of log compaction as a base, you can later explore the other two methods.

For further information on Kafka Debezium Event Sourcing, Azure Kafka Integration, Apache Kafka Queue, you can visit the following links.

Hevo Data, a No-code Data Pipeline, provides you with a consistent and reliable solution to manage data transfer between a variety of sources, such as Apache Kafka, and a wide variety of Desired Destinations with a few clicks. Hevo also allows integrating data from non-native sources using Hevo’s in-built Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools. Sign up for Hevo’s 14-day free trial and experience seamless data migration.

Frequently Asked Questions

1. What is a compacted topic in Kafka?

Compacted Topics in Kafka are a type of topic where log compaction is applied. Log compaction ensures that the latest value for each key is retained in the topic, even if older messages are deleted.

2. What triggers compaction in Kafka?

Based on log size, time-based retention, or explicit configuration.

3. Is Kafka topic push or pull?

Kafka uses a pull model for consumers (they pull messages) and a push model for producers (they push messages).

Ishwarya M
Technical Content Writer, Hevo Data

Ishwarya is a skilled technical writer with over 5 years of experience. She has extensive experience working with B2B SaaS companies in the data industry, she channels her passion for data science into producing informative content that helps individuals understand the complexities of data integration and analysis.