Apache Kafka is a well-known open-source Data Streaming Platform that enables high-throughput Data Pipelines.
As a critical component of the IT infrastructure, it is necessary to have a dedicated tool or feature to track Kafka’s operations and their efficiencies.
Kafka Metrics helps in monitoring critical tasks while operating Kafka or Kafka Applications, not only to troubleshoot problems that have already occurred but also to discover anomalous behavior patterns and prevent problems from occurring in the first place.
When adopting Kafka, developers have access to a plethora of Kafka Metrics, but this abundance can be daunting, making it difficult to know where to begin and how to correctly use them.
Kafka provides Monitoring Tools that garner all performance metrics helpful for identifying bottlenecks and informing you which ones require corrective action.
This blog will introduce you to some of the most important Kafka Metrics that you should use to achieve peak performance.
Table of Contents
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from Apache Kafka and 100+ Data Sources (including 30+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice.
It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent.
Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
GET STARTED WITH HEVO FOR FREE
Kafka Metrics to Monitor for Optimal Performance
Metrics record a value about your systems at a certain moment in time, such as the number of people that are currently checking out their cart from a website.
As a result, metrics are typically collected once per second, once per minute, or at another regular interval to track the performance of a system over time.
Kafka Metrics quantify how effectively a component performs its function, e.g., network latency. A well-functioning Kafka Cluster can manage a large volume of data.
It is critical to monitor the health of your Kafka deployment in order to ensure that the apps that rely on it continue to run reliably.
Here are some of the key Kafka Metrics that can help users gauge the performance of the data streaming processes.
In general, the biggest barrier to Kafka’s performance is disc throughput.
If you are delivering messages across Data Centers, if your topics have a high number of consumers, or if your replicas are catching up to their leaders, network throughput can have an impact on Kafka’s performance.
Tracking network performance on your brokers provides extra information about potential bottlenecks and might help you decide whether or not to implement end-to-end message compression.
You can select a replication number per topic as needed to ensure data durability and that brokers are always accessible to send data.
This will ensure that data gets duplicated across many brokers and will be available for processing even if one broker fails.
The Kafka UnderReplicatedPartitions metric notifies you when there are less than the required number of active brokers for a specific topic.
As a general rule, there should be no under-replicated partitions in a running Kafka deployment (number should always be zero), making this a critical Kafka Monitoring Metric.
kafka.server: type=ReplicaManager, name=UnderReplicatedPartitions
Leader Election Rate
When a Partition Leader dies, an election to choose a Replacement Leader is called. If a Partition Leader fails to maintain its ZooKeeper session, it is called “dead.”
LeaderElectionRateAndTimeMs display the rate of leader elections (per second) as well as the overall amount of time the cluster spent without a leader (in milliseconds).
It is important to note that a leader election occurs when contact with the existing leader is lost, which might result in an offline broker.
Network Request Rate
Kafka Brokers can be sources of significant network traffic since their objective is to collect and transport data for processing.
By analyzing the number of network requests per second, you may monitor and compare the network throughput per server in each of your Data Centers and Cloud providers.
When a broker’s network bandwidth exceeds or falls below a certain threshold, it might signal the need to increase the number of brokers or that some issue is creating a delay.
It may also signal the need to create (or change) a consumer back-off protocol to better handle the data rates.
kafka.network: type=RequestMetrics, name=ErrorsPerSec
The response rate shows the rate of responses received from brokers for producers. When the data is received, brokers react to the producers.
Depending on your settings, “received” might imply one of 3 things:
- The message was received but did not get committed (request.required.acks == 0).
- The message has been written to disk by the leader (request.required.acks == 1).
- The leader has received confirmation that the data has been written to disc from all replicas (request.required.acks == all).
I/O Wait Time
In general, producers do one of these 2 things: wait for data or send data. When producers generate more data than they can transfer, they must wait for network resources.
However, if producers aren’t rate-limited or using all of their available bandwidth, identifying the bottleneck becomes more difficult.
Checking I/O wait times on your producers is a smart place to start because disc access is typically the slowest portion of any processing activity.
In this case, I/O wait indicates the percentage of time spent doing I/O when the CPU was idle. Excessive wait times indicate that your producers are unable to obtain the data they want in a timely fashion.
The estimated difference between a consumer’s current log offset and a producer’s current log offset is referred to as records lag.
The maximum observed value of records lag is given by records lag max. The relevance of these indicators’ values is entirely dependent on what your customers do.
If your customers back up old communications to long-term storage, you might expect considerable record latency.
However, if your consumers are processing real-time data, continuously high lag values might indicate that they are overcrowded, in which case both providing additional consumers and distributing topics across more partitions could assist in enhancing throughput and minimizing latency.
Log Flush Latency
Kafka saves data by adding new log files to old ones. Cache-based writes are flushed to physical storage asynchronously based on numerous Kafka internal parameters to maximize performance and durability.
You can force this writing by using the “fsync” system function. You can tell Kafka when to flush by using the log.flush.interval.messages and log.flush.interval.ms settings.
It is suggested in the Kafka documentation that you should not set these and instead use the operating system’s background flush capabilities, which are more efficient.
Latency is calculated by comparing the actual log flush time to the planned time. It can suggest the need for greater replication and scale, quicker storage, or a hardware issue.
kafka.log: type=LogFlushStats, name=LogFlushRateAndTimeMs
Offline Partition Count
Offline partitions are Data Stores that are unavailable to your applications as a result of a server failure or restart.
In this scenario, one of the brokers in a Kafka Cluster acts as the controller, maintaining the statuses of partitions and replicas as well as reassigning partitions as necessary.
If the number of partitions on a single server fluctuates owing to variables such as Cloud connection or other temporary network events, the rise in this count might suggest that partition replication has to be increased.
If the offline partition counts rise beyond zero, it suggests that the number of brokers must be increased as well, because fetches cannot keep up with the number of incoming messages.
Total Time to Service a Request
The TotalTimeMs metric family calculates the total time required to service a request such as:
- produce: requests from producers to send data.
- fetch-consumer: requests from consumers to get new data.
- fetch-follower: requests from brokers that are the followers of a partition to get new data.
The TotalTimeMs metric is the sum of four metrics:
- queue: the amount of time spent waiting for a request to be processed.
- local: time spent being processed by the leader.
- remote: time spent waiting for a response from a follower.
- response: the time to send the response.
This number should be reasonably constant under typical settings, with just minor changes.
If you notice unusual behavior, you should investigate the individual queue, local, remote, and response values to determine the particular request segment that is causing the problem.
This brings us to the end of the list of Kafka Metrics. By now you’d have understood how helpful are Kafka Metrics for assessing the performance of your data streaming processes.
Frequently Asked Questions (FAQs)
What are the key Kafka Metrics to monitor?
Few Kafka Metrics that can help you in monitoring Kafka are listed below:
- Network handler idle time
- Request handler idle time
- Under-Replicated partitions
- Leader Elections
- CPU idle time
- Host Network in/out
- Messages in/out
How to enable Kafka Metrics Reporter?
Metrics Reporter is not enabled by default. To enable Kafka Metrics Reporter you have to go to each Kafka Broker’s server.properties and set the metric.reporters and confluent.metrics.reporter.bootstrap.servers configuration parameters.
How to collect metrics from Kafka?
To collect Kafka Performance Metrics, you have to do the following:
- First, you need to collect native Kafka Performance Metrics from Kafka and ZooKeeper using any of these tools – JConsole, JMX, and Burrow.
- Monitor Kafka’s page cache using cachestat.
- Collecting ZooKeeper Metrics using any of these tools – JConsole, ZooKeeper’s “four letter words,” and the ZooKeeper AdminServer.
- Getting production-ready Kafka Performance monitoring using Datadog to collect and view metrics from your Kafka deployment.
How to test Kafka Performance?
The Kafka-*-perf-test tools include Kafka, kafka-producer-perf-test, and kafka-consumer-perf-test that help you test Kafka performance based on following ways:
- To measure read/ write throughput.
- To stress-test the cluster based on various parameters.
- Load testing to evaluate the impact of cluster configuration changes
How to optimize Kafka Performance?
There are ways to optimize the performance of the Kafka cluster. Increasing the number ofpartitions and brokers in a cluster will increase the parallelism of message consumption and keep the balance between throughput and latency.
You must constantly review Kafka’s status and efficiency to guarantee that applications that rely on it run smoothly.
This is where Kafka Metrics comes into the picture. The article mentions a few Kafka Metrics that highlight how monitoring your Kafka instance might help you achieve improved application performance if you’re just starting with Kafka Monitoring.
As the huge volumes of data flow from several data sources straight into Kafka via Kafka Connectors, it becomes essential to keep updated with Kafka Metrics for monitoring performance.
However, in businesses, extracting complex data from a diverse set of Data Sources can be a challenging task and this is where Hevo saves the day!
Take Hevo for a spin!
sign up for a 14-day free trial today.
Also, do check out our plans & pricing for different use cases and business needs, check them out!
Share your experience of working with Kafka Monitor in the comments section below and let us know about other important Kafka Monitoring Metrics (if we missed any 😉).