Key Kafka Monitoring Metrics for Optimal Performance

on Data Integration, Data Streaming, Kafka, Kafka Producers • January 27th, 2022 • Write for Hevo

Kafka Monitor Feature Image

Apache Kafka is a well-known open-source Data Streaming Platform that enables high-throughput Data Pipelines. As a critical component of the IT infrastructure, it is necessary to have a dedicated tool or feature to track Kafka’s operations and their efficiencies. Kafka provides Monitoring Tools that garner all performance metrics helpful for identifying bottlenecks and inform you which ones require corrective action. This blog will introduce you to some of the most important Kafka Monitoring Metrics that you should use to achieve peak performance.

Monitoring is a critical task while operating Kafka or Kafka Applications, not only to troubleshoot problems that have already occurred but also to discover anomalous behavior patterns and prevent problems from occurring in the first place. When adopting Kafka, Developers have access to a plethora of Kafka Monitoring Metrics, but this abundance can be daunting, making it difficult to know where to begin and how to correctly use them. Before getting into Kafka Monitoring, let’s discuss this robust platform in brief.

Table of Contents

What is Apache Kafka?

Kafka Monitor
Image Credit: Apache Kafka

Apache Kafka is an open-source Distributed Publish-Subscribe Messaging Platform designed to handle real-time streaming data for distributed streaming, pipelining, and data feed replay for rapid, scalable operations. Kafka is a broker-based system that functions by storing data streams as records in a cluster of Computer Servers. Kafka Servers may span many Data Centers and offer data durability by storing streams of records (messages) in topics across numerous server instances. A topic keeps records or messages as a series of tuples, which are immutable Python objects that have a key, a value, and a timestamp.

Kafka Monitor: Kafka Cluster
Image Source: Medium

In its most basic version, a producer application creates a message for Kafka Topic that gets stored on Server or Broker. The consumer that is interested subscribes to the required topic and begins consuming messages from Kafka Server.

Key Features of Apache Kafka

Below are the reasons mentioned for the immense popularity of Kafka.

  • High Scalability: Kafka’s partitioned log architecture distributes data over several servers, enabling it to scale beyond the limits of a single server. Because it divides data streams, Kafka provides low latency and high throughput.
  • Low Latency: For big volumes of data, Apache Kafka offers an exceptionally low end-to-end latency of up to 10 milliseconds. This means that a data record written to Kafka may be retrieved quickly by the Consumer. 
  • Comprehensive Monitoring: Kafka is a popular tool for tracking operational data. This necessitates gathering data from several apps and aggregating it into consolidated feeds with analytics.
  • Fault-Tolerant: Kafka protects data from server failure and makes it fault-tolerant by spreading partitions and copying data over other servers. Kafka Clusters are very scalable and durable, i.e., if one of the servers fails, the others take over to guarantee that the system continues to operate without interruption and without data loss.
  • Multiple Integrations: Kafka can connect with a number of Data-Processing Frameworks and Services, such as Apache Spark, Apache Storm, Hadoop, and Amazon Web Services. By connecting them with various applications, you can easily integrate Kafka’s capabilities into your real-time data pipelines.

Now that you’re familiar with what Apache Kafka is, let’s dive straight into Kafka Monitoring Metrics.

Simplify Apache Kafka Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from Apache Kafka and 100+ Data Sources (including 30+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get started with hevo for free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Kafka Monitoring Metrics

Kafka Monitoring Metrics
Image Source: Medium

Metrics record a value about your systems at a certain moment in time, such as the number of people that are currently checking out their cart from a website. As a result, metrics are typically collected once per second, once per minute, or at another regular interval to track the performance of a system over time. Kafka Monitoring Metrics quantify how effectively a component performs its function, e.g., network latency. 

A well-functioning Kafka Cluster can manage a large volume of data. It is critical to monitor the health of your Kafka deployment in order to ensure that the apps that rely on it continue to run reliably. Here are some of the key Kafka Monitoring Metrics that can help users gauge the performance of the data streaming processes. 

Network Request Rate

Kafka Brokers can be sources of significant network traffic since their objective is to collect and transport data for processing. By analyzing the number of network requests per second, you may monitor and compare the network throughput per server in each of your Data Centers and Cloud providers. When a broker’s network bandwidth exceeds or falls below a certain threshold, it might signal the need to increase the number of brokers or that some issue is creating a delay. It may also signal the need to create (or change) a consumer back-off protocol to better handle the data rates.

kafka.network: type=RequestMetrics, name=ErrorsPerSec

Under-replicated Partitions

You can select a replication number per topic as needed to ensure data durability and that brokers are always accessible to send data. This will ensure that data gets duplicated across many brokers and will be available for processing even if one broker fails. The Kafka UnderReplicatedPartitions metric notifies you when there are less than the required number of active brokers for a specific topic. As a general rule, there should be no under-replicated partitions in a running Kafka deployment (number should always be zero), making this a critical Kafka Monitoring Metric.

kafka.server: type=ReplicaManager, name=UnderReplicatedPartitions

Response Rate

The response rate shows the rate of responses received from brokers for producers. When the data is received, brokers react to the producers. Depending on your settings, “received” might imply one of 3 things:

  • The message was received but did not get committed (request.required.acks == 0).
  • The message has been written to disk by the leader (request.required.acks == 1).
  • The leader has received confirmation that the data has been written to disc from all replicas (request.required.acks == all).
kafka.producer:type=producer-metrics,client-id=([-.w]+)

Log Flush Latency

Kafka saves data by adding new log files to old ones. Cache-based writes are flushed to physical storage asynchronously based on numerous Kafka internal parameters to maximize performance and durability. You can force this writing by using the “fsync” system function. You can tell Kafka when to flush by using the log.flush.interval.messages and log.flush.interval.ms settings. It is suggested in the Kafka documentation that you should not set these and instead use the operating system’s background flush capabilities, which are more efficient.

Latency is calculated by comparing the actual log flush time to the planned time. It can suggest the need for greater replication and scale, quicker storage, or a hardware issue.

kafka.log: type=LogFlushStats, name=LogFlushRateAndTimeMs

I/O Wait Time

In general, producers do one of these 2 things: wait for data or send data. When producers generate more data than they can transfer, they must wait for network resources. However, if producers aren’t rate-limited or using all of their available bandwidth, identifying the bottleneck becomes more difficult. Checking I/O wait times on your producers is a smart place to start because disc access is typically the slowest portion of any processing activity. In this case, I/O wait indicates the percentage of time spent doing I/O when the CPU was idle. Excessive wait times indicate that your producers are unable to obtain the data they want in a timely fashion.

kafka.producer:type=producer-metrics,client-id=([-.w]+)

You’re now halfway through the list of Kafka Monitoring Metrics, let’s discuss the other half in detail.

Leader Election Rate

When a Partition Leader dies, an election to choose a Replacement Leader is called. If a Partition Leader fails to maintain its ZooKeeper session, it is called “dead.” LeaderElectionRateAndTimeMs display the rate of leader elections (per second) as well as the overall amount of time the cluster spent without a leader (in milliseconds). It is important to note that a leader election occurs when contact with the existing leader is lost, which might result in an offline broker.

kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

Total Time to Service a Request

The TotalTimeMs metric family calculates the total time required to service a request such as:

  • produce: requests from producers to send data.
  • fetch-consumer: requests from consumers to get new data.
  • fetch-follower: requests from brokers that are the followers of a partition to get new data.

The TotalTimeMs metric is the sum of four metrics:

  1. queue: the amount of time spent waiting for a request to be processed.
  2. local: time spent being processed by the leader.
  3. remote: time spent waiting for a response from a follower. 
  4. response: the time to send the response.

This number should be reasonably constant under typical settings, with just minor changes. If you notice unusual behavior, you should investigate the individual queue, local, remote, and response values to determine the particular request segment that is causing the problem.

kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}

BytesInPerSec/BytesOutPerSec

In general, the biggest barrier in Kafka’s performance is disc throughput. If you are delivering messages across Data Centers, if your topics have a high number of consumers, or if your replicas are catching up to their leaders, network throughput can have an impact on Kafka’s performance. Tracking network performance on your brokers provides extra information about potential bottlenecks and might help you decide whether or not to implement end-to-end message compression.

kafka.server:type=BrokerTopicMetrics,name={BytesInPerSec|BytesOutPerSec}

Records Lag

The estimated difference between a consumer’s current log offset and a producer’s current log offset is referred to as records lag. The maximum observed value of records lag is given by records lag max. The relevance of these indicators’ values is entirely dependent on what your customers do. If your customers back up old communications to long-term storage, you might expect considerable record latency. However, if your consumers are processing real-time data, continuously high lag values might indicate that they are overcrowded, in which case both providing additional consumers and distributing topics across more partitions could assist in enhancing throughput and minimizing latency.

kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+),partition=([-.w]+)

Offline Partition Count

Offline partitions are Data Stores that are unavailable to your applications as a result of a server failure or restart. In this scenario, one of the brokers in a Kafka Cluster acts as the controller, maintaining the statuses of partitions and replicas as well as reassigning partitions as necessary.

If the number of partitions on a single server fluctuates owing to variables such as Cloud connection or other temporary network events, the rise in this count might suggest that partition replication has to be increased. If the offline partition counts rise beyond zero, it suggests that the number of brokers must be increased as well, because fetches cannot keep up with the number of incoming messages.

kafka.controller:type=KafkaController,name=OfflinePartitionsCount

This brings us to the end of the list of Kafka Monitoring Metrics. By now you’d have understood how helpful are Kafka Monitoring Metrics for assessing the performance of your data streaming processes.

Conclusion

You must constantly review Kafka’s status and efficiency to guarantee that applications that rely on it run smoothly. This is where Kafka Monitoring Metrics comes into the picture. The article mentions a few Kafka Monitoring Metrics that highlight how monitoring your Kafka instance might help you achieve improved application performance if you’re just starting with Kafka Monitoring. However, in businesses, extracting complex data from a diverse set of Data Sources can be a challenging task and this is where Hevo saves the day!

visit our website to explore hevo

Hevo Data with its strong integration with 100+ Sources & BI tools such as Apache Kafka, allows you to not only export data from sources & load data in the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.

Give Hevo Data a try and sign up for a 14-day free trial today. Hevo offers plans & pricing for different use cases and business needs, check them out!

Share your experience of working with Kafka Monitor in the comments section below. Also, let us know about other important Kafka Monitoring Metrics that we might have missed in this piece.

No-code Data Pipeline For Your Data Warehouse