Are you confused about what Log Compaction is all about? It is an effective recovery strategy and also to also manage your data log up to a threshold limit. Do you want to learn about the numerous strategies such as Kafka Log Compaction, that you can use to achieve compaction for your data logs? If yes? Then you’ve landed at the right place!

This article aims at providing you with in-depth knowledge about various strategies that you can use to achieve compaction, their limitations, etc. with a strong focus on how Kafka provides a hybrid approach using its functionality.

What is Apache Kafka

Kafka Log Compaction: Kafka Logo | Hevo Data

Apache Kafka is a popular real-time data streaming software that allows users to store, read and analyze streaming data using its open-source framework. Being open-source, it is available free of cost to users. Leveraging its distributed nature, users can achieve high throughput, minimal latency, computation power, etc., and handle large volumes of data with ease.

Key features of Kafka:

  • Scalability: Kafka has exceptional scalability and can be scaled easily without downtime.
  • Data Transformation: Kafka offers KStream and KSQL (in the case of Confluent Kafka) for on the fly data transformation.
  • Fault-Tolerant: Kafka uses brokers to replicate data and persists the data to make it a fault-tolerant system.
  • Security: Kafka can be combined with various security measures like Kerberos to stream data securely.
  • Performance: Kafka is distributed, partitioned, and has a very high throughput for publishing and subscribing to the messages.

For further information on Apache Kafka, you can check the official website here.

Understanding Apache Kafka Topics & Nodes

Apache Kafka logically stores your data with the help of Kafka Topics. A “Kafka Topic” is a collection of semantically similar records, analogous to how a database table works. Each Kafka Topic stores numerous data records in the form of partitions, such that each data partition is available across separate machines, ensuring data availability and parallel access at all times.

To ensure data availably and a fault-tolerant operation, Apache Kafka creates multiple copies of the data and stores it across several servers/ brokers known as nodes. Each node further contains numerous partitions and replicates them across other nodes.

Kafka Log Compaction: Apache Kafka Nodes and Topics | Hevo Data
Apache Kafka Nodes and Topics

Each “Kafka Node” ensures a successful replication of the new/incoming data across all data nodes by using the leader & follower concept. The leader is the first one to receive the incoming data records and is responsible not only for storing them but also for replicating them to all the followers, allowing them to save their copies with a proper offset. Each offset is a unique integer value that acts as a key for each partition. Within a data partition, all messages are stored in a sorted manner, based on each message’s offset.

This is how Apache Kafka stores the data using Kafka Topics & Nodes in an arrangement known as a “write-ahead log”.

Why do we need Log Compaction?

Organizations widely use Kafka for event streaming, distribution, and real-time processing of scale of messages. Now imagine if this scale grows by the factor of 1000, 10000, or let’s say 100000. Now, this data keeps on pumping and the storage will run out of space to store more messages. So what should we do? The answer is Kafka Log Compaction.

Log Compaction is a mechanism by which you can selectively remove records where you have recent updates with the same primary key. Applying this strategy ensures that the Kafka log is guaranteed will retain and preserve at least the last message for each message key for a single topic partition. 

It helps in restoring the previous state of the machine or application crashed due to some failure. It reloads the cache on system restart. Apache Kafka keeps the latest version of a record and deletes all previous versions with the same key using Log Compaction.

What are the Guarantees Provided by Log Compaction?

Log Compaction is the mechanism for achieving per-message retention, rather than time-based retention that also takes care of system failure, system restarts cases, etc. A pool of background threads recopies the log segment files to handle Kafka Log Compaction. It removes records whose key appears in the head of the log.

Log Compaction guarantees that Kafka will retain at least the last known value for each message key within the log. It works well for temporal event data such as logging where each record stands alone. But what about other classes of data streams such as a log of changes to keyed, mutable data? 

It comes with a more coarse retention mechanism that guarantees to retain at least the last update for each primary key. This guarantees that the Kafka Log Compaction will contain the full snapshot of not only just keys recently changed but also the final value for every key.

Prerequisites

  • Working Knowledge of Apache Kafka.
  • A general idea about data logging.

Methods to Achieve Compaction

There are multiple ways in which you can achieve compaction for your data logs:

Method 1: Using the Traditional Method of Discarding Old Data

The traditional approach of achieving compaction is by discarding your old data after some time or once you’ve reached the storage threshold. It works exceptionally well when you’re working with data associated with state changes or temporal events, such as logging, storing current locations, etc.

For example, parcel tracking is a perfect scenario to use the traditional method, as the parcel remains in transit and keeps on changing its location until it reaches the customer, who’s keeping track of their parcel, without worrying about its previous location.

This is how you can use the traditional method of discarding old data to achieve compaction.

Limitations of using this method

  • Followers can’t recreate the current state if they have been out of sync for a long time, especially when the data logs associated with historical events or updates are no longer available.

Method 2: Storing Old Logs in the Compressed Format

A slightly different approach to achieving compaction is to store the old data logs in the compressed format and then make the new ones available in an easy-to-read format.

Using this technique, you can fetch the historical logs in the compressed format, uncompress them, and then recreate the state with data from its very inception. It comes in handy, in case you want to create a new additional node that has all the data from the very start.

This is how you can use the method of storing old data logs in the compressed format to achieve compaction.

Limitations of using this method

  • One major downside of using this method is that the procedures associated with log management and compression & decompression, tend to get error-prone with time, and hence it is tedious to manage them.

Method 3: Kafka Log Compaction, A Hybrid Approach

Kafka Log Compaction is a robust hybrid approach that ensures that you always have an effective recovery strategy in place and can manage your data log up to the size threshold with ease.

With Kafka Log Compaction in place, Kafka ensures that you at least always have the last known value for each message key for your data log. It follows a smart approach and removes only those records, which have just received an update recently with the same primary key. Thereby ensures that at least the latest state is available for every message key.

For example, if you consider the parcel location dataset, Kafka Log Compaction ensures that you can retain at least the latest update for every primary key (parcel ID + location). You can see from the image below, that for the parcel with ID 1, the logs store its position as City 1 at first, and then as the parcel moves, the value for the location changes in the data log.

Kafka Log Compaction | Hevo Data

The data logs store the data in such a way that the head always has the latest events/updates, whereas all the historical records are present in the tail.

The Kafka Log Compaction will now keep the latest location only, that is City 4, and remove all other city locations. Downstream followers can hence restore their state of the topic with ease, without the need to retain a complete data log of changes that took place from the very start of parcel transportation.

Kafka Log Compaction: Final Kafka Log for Parcel dataset | Hevo Data

Apache Kafka further houses a “cleaner thread” that helps remove the records marked for deletion. Taking the example of the parcel dataset, once the parcel reaches the customer, you need to remove it from the current Kafka Topic and then append it to the Kafka Topic associated with delivered customer parcels. One way to do this is to send an empty message with the associated key as follows:

{ parcel-ID, null)

Such a record is known as the “Tombstone” record, and once the “cleaner thread” executes, it will automatically remove it.

This is how you can use Kafka Log Compaction, a hybrid approach to achieve compaction.

Configuration Variables Associated with Kafka Log Compaction

The following are some of the variables that you can use to configure Kafka Log Compaction:

  • min.compaction.lag.ms: It represents the minimum amount of time after which the Kafka Log Compaction can take place, once the new message arrives in the log. It usually acts as the lower bound of how long a message remains in the head.
  • max.compaction.lag.ms: It represents the maximum time delay after which the Kafka Log Compaction can take place, once the new message arrives in the log.
  •  log.cleaner.min.compaction.lag.ms: It represents the minimum age only after which the data messages become eligible for Kafka Log Compaction. It indicates the length of the head, up to which the data message is not ready for Kafka Log Compaction.
  • delete.retention.ms: It represents the maximum amount of time a record marked for deletion will remain present in the partition. Any such consumer that is lagging by a time more than “delete.retention.ms”, will end up missing the delete markers.
  • auto.leader.rebalance.enable: It is responsible for enabling the automatic leader balancing mechanism once the leader imbalance ratio exceeds the threshold(leader.imbalance.per.broker.percentage).
  • log.flush.interval.messages: It represents the total number of data messages that accumulate on a log partition before the disk flushing takes place for them.
  • log.flush.interval.ms: It represents the maximum amount of time (in milliseconds), any message present in a topic, stays in the memory before disk flushing takes place for them.

What is a Log Compacted Topic?

Kafka deletes the older version of a record whenever a newer version is available with the same key in the partition log. For example, a partition of a log compacted topic called latest-product-price. 

It has two records with the same key. Because it’s a log compacted topic, Kafka will remove the older record in a background thread. Now consider a producer sending new records to the partition with the existing keys. Now background thread will again remove the older records with the same keys and update the new record. 

The compacted log in Kafka Log Compaction consists of two parts: a tail and a head part. Kafka Log Compaction ensures that the tail has a unique key because a tail part is scanned in the previous cycle of the cleaning process. Though, a head part can have duplicate values.

Creating a Log Compacted Topic

Now that you have a good understanding of the Log Compacted Topic. In this section, let’s learn how to create a Log Compacted Topic. The following steps are listed below:

First, create a log compacted topic using the following command given below:

kafka-topics --create --zookeeper zookeeper:2181 --topic latest-product-price --replication-factor 1 --partitions 1 --config "cleanup.policy=compact" --config "delete.retention.ms=100" --config "segment.ms=100" --config "min.cleanable.dirty.ratio=0.01"

Now, let’s produce some records.

kafka-console-producer --broker-list localhost:9092 --topic latest-product-price --property parse.key=true --property key.separator=:
>p3:10$
>p5:7$
>p3:11$
>p6:25$
>p6:12$
>p5:14$
>p5:17$

In the above output, the key and values are separated by colon. 

Then, consume the topics using the following command given below:

kafka-console-consumer --bootstrap-server localhost:9092 --topic latest-product-price --property print.key=true --property key.separator=: --from-beginning
p3:11$
p6:12$
p5:14$
p5:17$

Here, the old records with the duplicate keys are removed. Except p5:14$ is not removed because it is a part of the head.

What are Apache Partition Segments?

Apache Kafka divides partition logs into segment files. A partition log is an abstraction that makes it easier for users to consume ordered messages inside the partition without worrying about internal storage. Segments are the files stored in the file system with a directory name ending with .log. 

A partition log has a base segment at the beginning and an active segment at the end. The first offset segment is known as a base offset, and it has a file name always equal to its base offset value. Only the active segment can receive newly produced messages.

Conclusion

This article introduced you to the concept of Log Compaction. It provided in-depth knowledge about the various methods that can you can use to achieve compaction, with a strong focus on Kafka Log Compaction. If you’re looking for an all-in-one solution, that will not only help you transfer data but also transform it into analysis-ready form, then Hevo Data is the right choice for you! It will take care of all your analytics needs in a completely automated manner, allowing you to focus on key business activities. 

Tell us about your experience of learning. Let us know in the comments section below!

Sarthak Bhardwaj
Customer Experience Engineer, Hevo

Sarthak is a skilled professional with over 2 years of hands-on experience in JDBC, MongoDB, REST API, and AWS. His expertise has been instrumental in driving Hevo's success, where he excels in adept problem-solving and superior issue management. Sarthak's technical proficiency and strategic approach have consistently contributed to optimizing operations and ensuring seamless performance, making him a vital asset to the team.

All your customer data in one place.