Apache Kafka is an open-source data streaming platform that collects, stores, organizes, and manages real-time data moving into Kafka servers. Since Kafka servers can stream trillions of real-time data or records per day, they often get overloaded because of the continuous streaming of real-time messages. To eliminate such complications, you can implement a Kafka compacted topic strategy through which you can manually select and remove the replicated or duplicated messages inside Kafka servers.
In this article, you will gain information about Kafka Compacted Topic. You will also gain a holistic understanding of Apache Kafka, its key features, Kafka Topics, Partition, and the steps for creating a Kafka Compacted Topic to implement the process of Log Compaction in Kafka. Read along to find out in-depth information about Kafka Compacted Topics.
Table of Contents
- A fundamental understanding of real-time data streaming
- Basic understanding of Apache Kafka
What is Apache Kafka?
Apache Kafka was originally developed at LinkedIn to address their need for Monitoring Activity Stream Data and Operational Metrics such as CPU, I/O usage, and request timings. Subsequently, in early 2011, it was Open-Sourced through the Apache Software Foundation. Apache Kafka is a Distributed Event Streaming Platform written in Java and Scala. It is a Publish-Subscribe (pub-sub) Messaging Solution used to create Real-Time Streaming Data Pipelines and applications that adapt to the Data Streams.
Kafka deals with Real-Time volumes of data and swiftly routes it to various consumers. It provides seamless integration between the information of producers and consumers without obstructing the producers and without revealing the identities of consumers to the producers.
There are three main components in Kafka architecture: Kafka producers, brokers or servers, and consumers. Kafka producers publish or write real-time messages into Kafka servers, while Kafka consumers fetch or read messages from Kafka servers. Since Kafka is used by various external applications to write (publish) and read (subscribe) real-time messages to and fro the Kafka servers, it is also called a publish-subscribe messaging service. Today, because of its rich features and functionalities, Kafka is being used by the world’s most prominent companies like Spotify, Netflix, Airbnb, and Uber for implementing real-time streaming services.
Kafka Core concepts:
- Producer: An application that sends data (message records) to the Kafka server.
- Consumer: An application that receives data from the Kafka server in the form of message records.
- Broker: A Kafka Server that acts as an agent/broker for message exchange.
- Cluster: A collection of computers that each run one instance of the Kafka broker.
- Topic: The data stream is given an arbitrary name.
- Zookeeper: A server/broker that stores a large number of shared pieces of information.
You can also have a look at Kafka documentation.
Key Features of Apache Kafka
Apache Kafka provides the following features such as communicating through messaging and stream processing to enable real-time data storage and analysis.
- Persistent messaging: Any type of information loss cannot be tolerated in order to gain real value from big data. Apache Kafka is built with O(1) Disc Structures that deliver constant-time performance even with very high volumes of stored messages (in the TBs).
- High Throughput: Kafka was designed to work with large amounts of data and support Millions of Messages per Second.
- Distributed event streaming platform: Apache Kafka facilitates Message Partitioning across Kafka servers and distributing consumption over a cluster of consumer systems while ensuring per-partition ordering semantics.
- Real-time solutions: Messages created by producer threads should be instantly available to consumer threads. This characteristic is essential in event-based systems like Complex Event Processing (CEP).
Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources (including 40+ Free Sources) such as Apache Kafka. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo loads the data onto the desired Data Warehouse/destination and enriches the data and transforms it into an analysis-ready form without having to write a single line of code.
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Get started with hevo for free
Check out why Hevo is the Best:
Sign up here for a 14-day free trial!
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
What are Kafka Topics?
In Kafka servers, real-time messages are stored inside topics in the form of Kafka partitions. A single Kafka topic can have multiple partitions, allowing Kafka to attain maximum fault tolerance by scaling topics across various servers or brokers in Kafka architecture, as shown in the image below.
Partitions in Kafka are nothing but the collection of segments inside which real-time messages are separately stored in a sequenced or ordered format, as shown below. The Kafka topic partition follows the append-only mechanism that organizes the incoming messages with respect to their arrival time.
Furthermore, Kafka producers publish messages into Kafka topics in a key-value pair format. Inside Kafka topics, the topic partition always contains records or messages in the form of key-value pairs. In the Key-value pair format, the key is a parameter used for record partitioning while the value represents the specific data produced by Kafka producers. For example, consider the following image where the Offset parameter represents the sequential order of messages according to the arrival time. The Key and Value parameters represent the respective Key-Value pair through which Kafka producers published messages inside Kafka topics.
Using the above-given partitioning mechanism, Kafka producers continuously publish real-time messages into Kafka topics inside Kafka servers. Kafka consumers fetch real-time messages from the respective Kafka servers according to their business needs or use cases. However, Kafka has a certain retention policy or time period for how long it tends to store real-time messages inside Kafka servers.
Apache Kafka’s retention policy is of two types:
- Time-based retention
- Size-based retention.
Kafka producers can set the retention time period for how long the published messages should be stored inside Kafka servers. This is called Time-based retention. The default retention time period of Kafka is seven days. You can increase and decrease the retention time period based on your preferences and use cases.
Size-based retention is the policy by which Kafka servers store or retain real-time messages only up to a certain storage limit. Once the limit is reached, the real-time messages are removed from the Kafka server. There is also a third upgraded way of retaining real-time messages in the Kafka server for a more extended period of time. This scenario of message retention in Kafka servers is called Log Compaction, which is one of the message cleanup policies of Kafka.
What is the Log Compaction process in Kafka?
Unlike the retention policy criteria, where Kafka automatically removes messages from Kafka servers when the time or size is reached, you can clean up messages manually with the help of the Log Compaction Process. In other words, using the log compaction method, you can selectively remove messages from each topic partition where the records are replicated or present more than once.
Usually, Kafka Producers publish messages into Kafka Topics in the form of key-value pairs, where key parameters have their respective values or messages. In some cases, Kafka producers might continuously publish new messages under the same key in a topic partitioned log. Consequently, keys in Kafka partitions are replicated with different values, causing Kafka consumers to face certain complexities during the message consuming process. To eliminate such complexities and make Kafka servers loaded with appropriate data, Kafka producers follow a log compaction process that selectively removes old messages whenever the topic partition is updated with the latest record under the same key.
For example, consider the above-given image. Initially, the offset of the topic partition is 1, 2, 3, and 4 with three different keys such as p3, p5, p3, and p4, where p3 is repeated twice. The keys p5 and p6 have distinct values as 7$ and 25$, while the p3 key has two different values, such as 10$ and 11$. In such cases, you can implement the process of log compaction to selectively remove the old values along with the repetitive keys based on an offsetting order of a single topic partition. This log compaction process retains at least the last known or latest record of a repetitive key in a topic partition inside Kafka servers.
After implementing log compaction in the background thread, the topic partition will resemble, as shown in the above image. The log compaction process removed the old record 10$ of the p3 key at offset 1 while retaining the latest record 11$ of the same key.
At this stage, assume a new set of Kafka producers appends another set of real-time messages into the same Kafka partitions with the duplicate keys such as p5 and p6. As shown in the above image, the new set of messages is appended to the end of the partition log. On appending the new set of messages, the offset count is also increased in the respective topic partition. Now, the background thread of Kafka predicts the repetition of messages or values under the same key and is ready to implement the log compaction process.
After implementing log compaction, the topic partition resembles the image given above. In the existing topic partition, there are three keys, such as p3, p5, and p6, where p5 and p6 are repeated thrice and twice, respectively. With log compaction, the older values with duplicate keys are removed while retaining the newly arrived messages with distinct keys in the topic partition. Finally, the retained values are 11$, 12$, and 17$, having unique keys such as p3, p6, and p5, respectively. With this log compaction process, Kafka ensures, at least, to retain the last state of each key present in the Kafka partition inside Kafka servers.
Creating a Kafka Compacted Topic for Log Compaction
Since Log compaction is a command-based mechanism in Kafka, you have to execute in-line commands using a CLI (Command Line Interface) tool for implementing the process of log compaction in topic partitions. For continuing with the log compaction process of creating Kafka Log Compacted Topic, you have to set up and start the Kafka environment.
- Step 1: Initially, you have to run the Kafka server and Zookeeper instances. Execute the following commands to start the Kafka environment.
- To start the Zookeeper instance
- To start the Kafka server
- Step 2: On executing the commands given above, you successfully started the Kafka server and Zookeeper instance. Now, open a new command prompt and run the following command to create a log Kafka compacted topic.
kafka-topics --create --zookeeper zookeeper:2181 --topic latest-product-price --replication-factor 1 --partitions 1 --config "cleanup.policy=compact" --config "delete.retention.ms=100" --config "segment.ms=100" --config "min.cleanable.dirty.ratio=0.01"
- Step 3: The above command creates a new Kafka topic named latest-product-price with a single partition. The parameter cleanup.policy=compact starts the log compaction process while delete.retention.ms=100 parameter indicates how long you can retain the deletion marker after the cleanup process. Now, you have to produce some records inside the newly created Kafka compacted topic to check whether the log compaction process is implemented correctly.
- Step 4: Now, you can start a new producer and consumer console in Kafka by executing the following commands.
- To start a producer console:
kafka-console-producer --broker-list localhost:9092 --topic latest-product-price --property parse.key=true --property key.separator=:
- To start a consumer console:
kafka-console-consumer --bootstrap-server localhost:9092 --topic latest-product-price --property print.key=true --property key.separator=: --from-beginning
- Step 5: In the newly created producer console, publish the following messages in the form of key-value pair, as shown in the below image.
- Step 6: Now, the above message is consumed in the consumer panel. The consumed message will resemble the below-given image.
- Step 7: On examining the above output, you can confirm that the log compaction process of building Kafka Compacted Topic is implemented in the topic partition. Kafka removed the old value of the duplicated keys while retaining the latest value so that keys such as p3, p6, and p5 have their respective unique values inside a topic partition.
In this article, you have learned about Kafka Compacted Topic. This article also provided information on Apache Kafka, its key features, Kafka Topics, Partition, and the steps for creating a Kafka Compacted Topic to implement the process of Log Compaction in Kafka in detail. However, there are other message retention techniques like log cleaning and log deletion to implement record cleaning strategies in Kafka partitions. On keeping the fundamental understanding of log compaction as a base, you can later explore the other two methods.
For further information on Kafka Debezium Event Sourcing, Azure Kafka Integration, Apache Kafka Queue, you can visit the following links.
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations with a few clicks.
Visit our Website to Explore Hevo
Hevo Data with its strong integration with 100+ data sources (including 40+ Free Sources) such as Apache Kafka allows you to not only export data from your desired data sources & load it to the destination of your choice but also transform & enrich your data to make it analysis-ready. Hevo also allows integrating data from non-native sources using Hevo’s in-built Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools.
Want to give Hevo a try?
Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.
Share your experience of understanding Kafka Compacted Topic in the comment section below! We would love to hear your thoughts about Kafka Compacted Topics.