Kafka Log Compaction: A Comprehensive Guide

on Data Integration, Tutorials • November 11th, 2020 • Write for Hevo

Kafka Log Compaction | Hevo Data

Are you confused about what Log Compaction is all about? Kafka Log Compaction is an effective recovery strategy and also to also manage your data log up to a threshold limit. Do you want to learn about the numerous strategies such as Kafka Log Compaction, that you can use to achieve compaction for your data logs? If yes? Then you’ve landed at the right place! This article aims at providing you with in-depth knowledge about various strategies that you can use to achieve compaction, their limitations, etc. with a strong focus on how Kafka provides a hybrid approach using its Kafka Log Compaction functionality.

Table of Contents

What is Apache Kafka

Kafka Log Compaction: Kafka Logo | Hevo Data
Image Source

Apache Kafka is a popular real-time data streaming software that allows users to store, read and analyze streaming data using its open-source framework. Being open-source, it is available free of cost to users. Leveraging its distributed nature, users can achieve high throughput, minimal latency, computation power, etc., and handle large volumes of data with ease.

Written in Scala, Apache Kafka supports bringing in data from a large variety of sources and stores them in the form of “topics” by processing the information stream. It uses two functions, namely Producers, which act as an interface between the data source and Kafka Topics, and Consumers, which allow users to read and transfer the data stored in Kafka.

Key features of Kafka:

  • Scalability: Kafka has exceptional scalability and can be scaled easily without downtime.
  • Data Transformation: Kafka offers KStream and KSQL (in the case of Confluent Kafka) for on the fly data transformation.
  • Fault-Tolerant: Kafka uses brokers to replicate data and persists the data to make it a fault-tolerant system.
  • Security: Kafka can be combined with various security measures like Kerberos to stream data securely.
  • Performance: Kafka is distributed, partitioned, and has a very high throughput for publishing and subscribing to the messages.

For further information on Apache Kafka, you can check the official website here.

Replicate Kafka Data Using Hevo’s Data Pipelines

Hevo Data can be your go-to tool if you’re looking for Data Replication from 100+ Data Sources (including 40+ Free Data Sources) like Kafka into Redshift, Databricks, Snowflake, and many other databases and warehouse systems. Using Hevo you can load real-time data streams to the destination. To further streamline and prepare your data for analysis, you can process and enrich data streams using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

Get Started with Hevo for Free

With Hevo in place, you can reduce your Data Extraction, Cleaning, Preparation, and Enrichment time & effort by many folds! In addition, Hevo’s native integration with BI & Analytics Tools will empower you to mine your replicated data to get actionable insights. With Hevo as one of the best Kafka Replication tools, replication of data becomes easier.

Try our 14-day full access free trial today!

Understanding Apache Kafka Topics & Nodes

Apache Kafka logically stores your data with the help of Kafka Topics. A “Kafka Topic” is a collection of semantically similar records, analogous to how a database table works. Each Kafka Topic stores numerous data records in the form of partitions, such that each data partition is available across separate machines, ensuring data availability and parallel access at all times.

To ensure data availably and a fault-tolerant operation, Apache Kafka creates multiple copies of the data and stores it across several servers/ brokers known as nodes. Each node further contains numerous partitions and replicates them across other nodes.

Kafka Log Compaction: Apache Kafka Nodes and Topics | Hevo Data
Image Source

Each “Kafka Node” ensures a successful replication of the new/incoming data across all data nodes by using the leader & follower concept. The leader is the first one to receive the incoming data records and is responsible not only for storing them but also for replicating them to all the followers, allowing them to save their copies with a proper offset. Each offset is a unique integer value that acts as a key for each partition. Within a data partition, all messages are stored in a sorted manner, based on each message’s offset.

This is how Apache Kafka stores the data using Kafka Topics & Nodes in an arrangement known as a “write-ahead log”.

Simplify your data analysis with Hevo’s No-code Data Pipelines

Hevo Data, a No-code Data Pipeline, helps to load data from Kafka (among 100+ sources) and load it in a data warehouse of your choice to visualize it in your desired BI tool. Hevo is fully-managed, and completely automates the process of not only loading data from your desired source but also enriching the data and transforming it into an analysis-ready form without having to write a single line of code. Its fault-tolerant architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

It provides a consistent & reliable solution to manage data in real-time and always have analysis-ready data in your desired destination. It allows you to focus on key business needs and perform insightful analysis using BI tools. 

Get Started with Hevo for Free

Check out what makes Hevo amazing:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

Simplify your data analysis with Hevo today! Sign up here for a 14-Day Free Trial!

Why do we need Log Compaction?

Organizations widely use Kafka for event streaming, distribution, and real-time processing of scale of messages. Now imagine if this scale grows by the factor of 1000, 10000, or let’s say 100000. Now, this data keeps on pumping and the storage will run out of space to store more messages. So what should we do? The answer is Kafka Log Compaction.

Log Compaction is a mechanism by which you can selectively remove records where you have recent updates with the same primary key. Applying this strategy ensures that the Kafka log is guaranteed will retain and preserve at least the last message for each message key for a single topic partition. 

Kafka Log Compaction helps in restoring the previous state of the machine or application crashed due to some failure. It reloads the cache on system restart. Apache Kafka keeps the latest version of a record and deletes all previous versions with the same key using Log Compaction.

What are the Guarantees Provided by Log Compaction?

Log Compaction is the mechanism for achieving per-message retention, rather than time-based retention that also takes care of system failure, system restarts cases, etc. A pool of background threads recopies the log segment files to handle Kafka Log Compaction. It removes records whose key appears in the head of the log.

Log Compaction guarantees that Kafka will retain at least the last known value for each message key within the log. It works well for temporal event data such as logging where each record stands alone. But what about other classes of data streams such as a log of changes to keyed, mutable data? 

Kafka Log Compaction comes with a more coarse retention mechanism that guarantees to retain at least the last update for each primary key. This guarantees that the Kafka Log Compaction will contain the full snapshot of not only just keys recently changed but also the final value for every key.

Prerequisites

  • Working Knowledge of Apache Kafka.
  • A general idea about data logging.

Methods to Achieve Compaction

There are multiple ways in which you can achieve compaction for your data logs:

Method 1: Using the Traditional Method of Discarding Old Data

The traditional approach of achieving compaction is by discarding your old data after some time or once you’ve reached the storage threshold. It works exceptionally well when you’re working with data associated with state changes or temporal events, such as logging, storing current locations, etc.

For example, parcel tracking is a perfect scenario to use the traditional method, as the parcel remains in transit and keeps on changing its location until it reaches the customer, who’s keeping track of their parcel, without worrying about its previous location.

This is how you can use the traditional method of discarding old data to achieve compaction.

Limitations of using this method

  • Followers can’t recreate the current state if they have been out of sync for a long time, especially when the data logs associated with historical events or updates are no longer available.

What makes Hevo’s Data Replication Experience Best in Class?

Replicating data can be a tiresome task without the right set of tools. Hevo’s Data Replication & Integration platform empowers you with everything you need to have a smooth Data Collection, Processing, and Replication experience. Our platform has the following in store for you!

  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Built-in Connectors: Support for 100+ Data Sources, including Kafka, Databases, SaaS Platforms, Files & More. Native Webhooks & REST API Connector available for Custom Sources.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Flexibility ~ designed for everyone.
  • Smooth Schema Mapping: Fully-managed Automated Schema Management for incoming data with the desired destination.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
Sign Up

Method 2: Storing Old Logs in the Compressed Format

A slightly different approach to achieving compaction is to store the old data logs in the compressed format and then make the new ones available in an easy-to-read format.

Using this technique, you can fetch the historical logs in the compressed format, uncompress them, and then recreate the state with data from its very inception. It comes in handy, in case you want to create a new additional node that has all the data from the very start.

This is how you can use the method of storing old data logs in the compressed format to achieve compaction.

Limitations of using this method

  • One major downside of using this method is that the procedures associated with log management and compression & decompression, tend to get error-prone with time, and hence it is tedious to manage them.

Method 3: Kafka Log Compaction, A Hybrid Approach

Kafka Log Compaction is a robust hybrid approach that ensures that you always have an effective recovery strategy in place and can manage your data log up to the size threshold with ease.

With Kafka Log Compaction in place, Kafka ensures that you at least always have the last known value for each message key for your data log. It follows a smart approach and removes only those records, which have just received an update recently with the same primary key. Kafka Log Compaction thereby ensures that at least the latest state is available for every message key.

For example, if you consider the parcel location dataset, Kafka Log Compaction ensures that you can retain at least the latest update for every primary key (parcel ID + location). You can see from the image below, that for the parcel with ID 1, the logs store its position as City 1 at first, and then as the parcel moves, the value for the location changes in the data log.

Kafka Log Compaction | Hevo Data

The data logs store the data in such a way that the head always has the latest events/updates, whereas all the historical records are present in the tail.

The Kafka Log Compaction will now keep the latest location only, that is City 4, and remove all other city locations. Downstream followers can hence restore their state of the topic with ease, without the need to retain a complete data log of changes that took place from the very start of parcel transportation.

Kafka Log Compaction: Final Kafka Log for Parcel dataset | Hevo Data

Apache Kafka further houses a “cleaner thread” that helps remove the records marked for deletion. Taking the example of the parcel dataset, once the parcel reaches the customer, you need to remove it from the current Kafka Topic and then append it to the Kafka Topic associated with delivered customer parcels. One way to do this is to send an empty message with the associated key as follows:

{ parcel-ID, null)

Such a record is known as the “Tombstone” record, and once the “cleaner thread” executes, it will automatically remove it.

This is how you can use Kafka Log Compaction, a hybrid approach to achieve compaction.

Configuration Variables Associated with Kafka Log Compaction

The following are some of the variables that you can use to configure Kafka Log Compaction:

  • min.compaction.lag.ms: It represents the minimum amount of time after which the Kafka Log Compaction can take place, once the new message arrives in the log. It usually acts as the lower bound of how long a message remains in the head.
  • max.compaction.lag.ms: It represents the maximum time delay after which the Kafka Log Compaction can take place, once the new message arrives in the log.
  •  log.cleaner.min.compaction.lag.ms: It represents the minimum age only after which the data messages become eligible for Kafka Log Compaction. It indicates the length of the head, up to which the data message is not ready for Kafka Log Compaction.
  • delete.retention.ms: It represents the maximum amount of time a record marked for deletion will remain present in the partition. Any such consumer that is lagging by a time more than “delete.retention.ms”, will end up missing the delete markers.
  • auto.leader.rebalance.enable: It is responsible for enabling the automatic leader balancing mechanism once the leader imbalance ratio exceeds the threshold(leader.imbalance.per.broker.percentage).
  • log.flush.interval.messages: It represents the total number of data messages that accumulate on a log partition before the disk flushing takes place for them.
  • log.flush.interval.ms: It represents the maximum amount of time (in milliseconds), any message present in a topic, stays in the memory before disk flushing takes place for them.

What is a Log Compacted Topic?

Kafka deletes the older version of a record whenever a newer version is available with the same key in the partition log. For example, a partition of a log compacted topic called latest-product-price. 

It has two records with the same key. Because it’s a log compacted topic, Kafka will remove the older record in a background thread. Now consider a producer sending new records to the partition with the existing keys. Now background thread will again remove the older records with the same keys and update the new record. 

The compacted log in Kafka Log Compaction consists of two parts: a tail and a head part. Kafka Log Compaction ensures that the tail has a unique key because a tail part is scanned in the previous cycle of the cleaning process. Though, a head part can have duplicate values.

Creating a Log Compacted Topic

Now that you have a good understanding of the Log Compacted Topic. In this section, let’s learn how to create a Log Compacted Topic. The following steps are listed below:

First, create a log compacted topic using the following command given below:

kafka-topics --create --zookeeper zookeeper:2181 --topic latest-product-price --replication-factor 1 --partitions 1 --config "cleanup.policy=compact" --config "delete.retention.ms=100" --config "segment.ms=100" --config "min.cleanable.dirty.ratio=0.01"

Now, let’s produce some records.

kafka-console-producer --broker-list localhost:9092 --topic latest-product-price --property parse.key=true --property key.separator=:
>p3:10$
>p5:7$
>p3:11$
>p6:25$
>p6:12$
>p5:14$
>p5:17$

In the above output, the key and values are separated by colon. 

Then, consume the topics using the following command given below:

kafka-console-consumer --bootstrap-server localhost:9092 --topic latest-product-price --property print.key=true --property key.separator=: --from-beginning
p3:11$
p6:12$
p5:14$
p5:17$

Here, the old records with the duplicate keys are removed. Except p5:14$ is not removed because it is a part of the head.

What are Apache Partition Segments?

Apache Kafka divides partition logs into segment files. A partition log is an abstraction that makes it easier for users to consume ordered messages inside the partition without worrying about internal storage. Segments are the files stored in the file system with a directory name ending with .log. 

A partition log has a base segment at the beginning and an active segment at the end. The first offset segment is known as a base offset, and it has a file name always equal to its base offset value. Only the active segment can receive newly produced messages.

Conclusion

This article introduced you to the concept of Log Compaction. It provided in-depth knowledge about the various methods that can you can use to achieve compaction, with a strong focus on Kafka Log Compaction. If you’re looking for an all-in-one solution, that will not only help you transfer data but also transform it into analysis-ready form, then Hevo Data is the right choice for you! It will take care of all your analytics needs in a completely automated manner, allowing you to focus on key business activities. 

Visit our Website to Explore Hevo

Want to take Hevo for a spin? Sign up here for a 14-Day Free Trial!and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Tell us about your experience of learning about Kafka Log Compaction. Let us know in the comments section below!

No-code Data Pipeline For Your Data Warehouse