What is Kafka Streams: A Comprehensive Guide 101

Aman Sharma • Last Modified: December 29th, 2022

Kafka Streams - Featured Image

Many Fortune 100 brands such as Twitter, LinkedIn, Airbnb, and several others have been using Apache Kafka for multiple projects and communications. This sudden credibility shift to Kafka sure makes one question the reason for this growth.

Developed in 2010, Kafka was rendered by a LinkedIn team, originally to solve latency issues for the website and its infrastructure. It proved to be a credible solution for offline systems and had an effective use for the problem at hand. In 2011, Kafka was used as an Enterprise Messaging Solution for fetching data reliably and moving it in real-time in a batch-based approach.

Since then Kafka Streams have been used increasingly, each passing day to follow a robust mechanism for data relay. Read along to know more about Apache Kafka and access a detailed guide for working with Kafka Streams.

Table of Contents

What is Apache Kafka?

Kafka Streams: Apache Kafka | Hevo Data
Image Source

Apache Kafka was developed as a publish-subscribe messaging system that now serves as a distributed event streaming platform capable of handling trillions of events in a single day. It offers persistent and scalable messaging that is reliable for fault tolerance and configurations over long periods.

What makes Kafka stand out is the mechanism of writing messages to a topic from where it can be read or derived. These messages can be retained for extended periods of time by applications that can reprocess these to deliver the details. 

Kafka can handle huge volumes of data and remains responsive, this makes Kafka the preferred platform when the volume of the data involved is big to huge. Hence, Kafka Big Data can be used for real-time analysis as well as to process real-time streams to collect Big Data. 

Kafka offers some distinct benefits over standard messaging brokers.

  • It allows de-bulking of the load as no indexes are required to be kept for the message. 
  • It enhances stream efficiency and gives a no-buffering experience to end-users.
  • All data logs are kept with a punched time without any data deletion taking place. Thus, it reduces the risk of data loss.

Due to these performance characteristics and scalability factors, Kafka has become an effective big data solution for big companies, looking to channelize their data fast and efficiently. It eventually brought in video stream services, such as Netflix, to use Kafka as a primary source of ingestion.

However, extracting data from Kafka and integrating it with data from all your sources can be a time-consuming & resource-intensive job. Alternatively, you can opt for a more economical & effortless Cloud-Based No-code ETL solution like Hevo Data that supports Kafka and 100+ other data sources to seamlessly load data in real-time into a Data Warehouse or a destination of your choice.

Limitations of Stream Processing

When it comes to real-time stream processing, some typical challenges during stream processing are as follows:

  • There are bulk tasks that are at a transient stage over different machines and need to be scheduled efficiently and uniformly.
  • Stream processors only retain just the adequate amount of data to fulfill the criteria of all the window-based queries active in the system, resulting in not-so-efficient memory management.
  • Most frameworks have to resort to code serializations and consequent transmission over a network. 
  • For a real-time or dynamic package transfer, the code has to be sent and deployed at individual machines along with the prerequisites needed to execute the same.
  • Real-time BI visualization requires data to be stored first in a table, which introduces latency and table management issues, particularly with data streams.

Replicate Kafka Data in Minutes using Hevo’s Data Pipelines

Hevo can be your go-to tool if you’re looking for Data Replication from 100+ Data Sources (including 40+ Free Data Sources) like Kafka into Redshift, Databricks, Snowflake, and many other databases and warehouse systems. To further streamline and prepare your data for analysis, you can process and enrich Raw Granular Data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

With Hevo in place, you can reduce your Data Extraction, Cleaning, Preparation, and Enrichment time & effort by many folds! In addition, Hevo’s native integration with BI & Analytics Tools will empower you to mine your replicated data to get actionable insights. With Hevo as one of the best Kafka Replication tools, replication of data becomes easier.

Try our 14-day full access free trial today!

Get Started with Hevo for Free

What is Kafka Streams?

Kafka Streams: Kafka Streams logo | Hevo Data
Image Source

Kafka Streams is a super robust world-class horizontally scalable messaging system. In other words, Kafka Streams is an easy data processing and transformation library within Kafka.

Kafka Streams offer a framework and clutter-free mechanism for building streaming services. You can club it up with your application code, and you’re good to go!

Here is what Kafka brings to the table to resolve targeted streaming issues:

  • Kafka Streams app imparts a uniform processing load, as a new iteration of the running components adds with each new instance. This instance can be recreated easily even when moved elsewhere, thus, making processing uniform and faster.
  • While a certain local state might persist on disk, any number of instances of the same can be created using Kafka to maintain a balance of processing load. 
  • Kafka Streams leave you windowing with out-of-order data using a DataFlow-like model.
  • Kafka combines the concept of streams and tables to simplify the processing mechanism further. 

Why are Kafka Streams Needed?

Ideally, Stream Processing platforms are required to provide integration with Data Storage platforms, both for stream persistence, and static table/data stream joins.

However, the fault-tolerance and scalability factors are staunchly limited in most frameworks. The deployment, configuration, and network specifics can not be controlled completely. 

You can overcome the challenges of Stream Processing by using Kafka Streams which offer more robust options to accommodate these requirements. Kafka Streams come with the below-mentioned advantages.

  • Kafka Streams comes with a fault-tolerant cluster architecture that is highly scalable, making it suitable for handling hundreds of thousands of messages every second.
  • Kafka Streams can be connected to Kafka directly and is also readily deployable on the cloud.
  • Kafka Streams can be accessed on Linux, Mac, and Windows Operating Systems, and by writing standard Java or Scala scripts.
  • Kafka Streams handles sensitive data in a very secure and trusted way as it is fully integrated with Kafka Security.

Kafka Streams Architecture

Here is the anatomy of an application that leverages the Streams API. This provides a logical view of Kafka Streams application that can contain multiple stream threads, which can in turn contain multiple stream tasks.

Kafka Streams: Kafka Streams Architecture | Hevo Data
Image Source

A Processor topology (or topology in simple terms) is used to define the Stream Processing Computational logic for your application. It refers to the way in which input data is transformed to output data.

A topology is a graph of nodes or stream processors that are connected by edges (streams) or shared state stores. In this topology, you can access the following two special processors:

  • Sink Processor: A Sink Processor is a kind of stream processor that does not include downstream processors. Sink Processors send any received records from their upstream processors to a particular Kafka topic.
  • Source Processor: A Source Processor is a kind of stream processor that does not leverage any upstream processors. Instead, it produces an input stream to its topology from one or multiple Kafka topics by extracting records from these topics and forwarding them to its downstream processors.

A Stream Processing Application can be used to define one or more such topologies, although it is generally used to define one specific topology. Developers can define topologies either through the low-level processor API or through the Kafka Streams DSL, which incrementally builds on top of the former.

Therefore, you can define processor topology as a logical abstraction for your Stream Processing code. The logical topology gets instantiated at runtime and is replicated within the application for parallel processing.

Kafka Streams: Processor Topology | Hevo Data
Image Source

Key Concepts of Kafka Streams

Kafka Streams is a popular client library used for processing and analyzing data present in Kafka. To understand it clearly, check out its following core stream processing concepts:


An important principle of stream processing is the concept of time and how it is modeled and integrated. You can classify Kafka Streams for the following time terms:

  • Event Time: This is the time when an event or record occurred and was then recorded “at the source”. For instance, you can consider the event time related to a GPS sensor of a car detecting a change in the location.
  • Processing Time: This is the time when a recorded event is processed(consumed) by a stream processing application. This time can range from milliseconds to hours to days, thereby being often slower than the original event time. For instance, the processing time of the analysis application that reads and processes geolocation data recorded by car sensors.
  • Ingest Time: This time represents the instance when an event is stored in the topic partition by the Kafka broker. The time of the event ingestion is the one when the dataset is added to the target topic by the Kafka broker & not when the dataset is created “at source”. It is also different from the processing time as an event may be stored in the topic portion but is not processed.


Providing SerDes (Serializer / Deserializer) for the data type of the record key and record value (eg java.lang.String) is essential for each Kafka Streams application to materialize the data as needed. SerDes information is important for operations such as stream (), table (), to (), through (), groupByKey (), and groupBy (). Kafka Streams allows you to deploy SerDes using any of the following methods:

  • Set the default SerDes via the StreamsConfig instance.
  • Explicitly specify SerDes when calling the corresponding API method, overriding the default.

DSL Operations

To define the Stream processing Topology, Kafka streams provides Kafka Streams DSL(Domain Specific Language) that is built on top of the Streams Processor API. Recommended for beginners, the Kafka DSL code allows you to perform all the basic stream processing operations:

  • Kafka Streams DSL supports a built-in abstraction of streams and tables in the form of KStream, KTable, Global KTable. This best-in-class support for streams and tables is extremely important, as in most practical scenarios you will require a combination of both, not just streams or databases/tables.
  • You can leverage the declarative functional programming style with stateless transformations (eg maps and filters) and stateful transformations such as aggregation (eg counting and reducing), joins (eg leftJoin), and windowing (eg session windows).

Scaling Kafka Streams

You can easily scale Kafka Streams applications by balancing load and state between instances in the same pipeline. Once the aggregated results are distributed among the nodes, Kafka Streams allows you to find out which node is hosting the key so that your application can collect data from the right node or send clients to the right node.

Interactive Queries

For applications that reside in a large number of distributed instances, each including a locally managed state store, it is useful to be able to query the application externally. To this end, Kafka Streams makes it possible to query your application with interactive queries. The Kafka Streams API enables your applications to be queryable from outside your application.

Developers can effectively query the local state store of an application instance, such as a local key-value store, a local window store, or a local user-defined state store. Alternatively, developers can add an RPC(Remote Procedure Call) layer to their application(for instance REST API), expose the application’s RPC endpoint, discover the application instance and its local state store, and query the remote state store for the entire app.

Stream Threading

In Kafka Streams, you can set the number of threads used for parallel processing of application instances. This allows threads to independently perform one or more stream jobs. Threads do not share state, so no coordination between threads is required. Kafka Streams automatically handles the distribution of Kafka topic partitions to stream threads.

Launching more stream threads or more instances of an application means replicating the topology and letting another subset of Kafka partitions process it effectively parallelizing the process.

Stream-Table Duality

Practical use cases demand both the functionalities of a stream and a table. Kafka Streams provides this feature via the Stream Table Duality. This close relationship between streams and tables can be seen in making your applications more elastic, providing fault-tolerant stateful processing, or executing Kafka Streams Interactive Queries against your application’s processing results.

  • Streams can be viewed as a history of table changes. Each record in the stream records a change in the state of the table. Therefore, a stream acts as a table that can easily be turned into a “real” table by repeating the changelog from start to finish and rebuilding the table.
  • Similarly, the table can be viewed as a snapshot of the last value of each key in the stream at a particular point in time (the record in the stream is a key/value pair). Therefore, the table can also behave as a stream that can be easily converted to a “real” stream by iterating over each key-value entry in the table.

What is Kafka Streams API?

Kafka Streams API can be used to simplify the Stream Processing procedure from various disparate topics. It can provide distributed coordination, data parallelism, scalability, and fault tolerance.

This API leverages the concepts of partitions and tasks as logical units that are strongly linked to the topic partitions and interact with the cluster.

A distinct feature of Kafka Streams API is that the applications that you build with it are just normal Java applications that can be deployed, packaged, or monitored just like any other Java application.

Kafka Streams API: Use Cases

Here are a few handy Kafka Streams examples that leverage Kafka Streams API to simplify operations:

  • Finance Industry can build applications to accumulate data sources for real-time views of potential exposures. It can also be leveraged for minimizing and detecting fraudulent transactions.
  • Travel companies can build applications with the API to help them make real-time decisions to find the best suitable pricing for individual customers. This allows them to cross-sell additional services and process reservations and bookings.
  • Retailers can leverage this API to decide in real-time on the next best offers, pricing, personalized promotions, and inventory management.
  • It can also be used by logistics companies to build applications to track their shipments reliably, quickly, and in real-time.
  • Manufacturing and automotive companies can easily build applications to ensure their production lines offer optimum performance while extracting meaningful real-time insights into their supply chains. The API can also be leveraged to monitor the telemetry data from linked cars to make a decision as to the need for a thorough inspection.

What makes Hevo’s Data Replication Experience Best in Class?

Replicating data can be a tiresome task without the right set of tools. Hevo’s Data Replication & Integration platform empowers you with everything you need to have a smooth Data Collection, Processing, and Replication experience. Our platform has the following in store for you!

  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Built-in Connectors: Support for 100+ Data Sources, including Kafka, Databases, SaaS Platforms, Files & More. Native Webhooks & REST API Connector available for Custom Sources.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Flexibility ~ designed for everyone.
  • Smooth Schema Mapping: Fully-managed Automated Schema Management for incoming data with the desired destination.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
Sign up here for a 14-Day Free Trial!

Working With Kafka Streams API

To start working with Kafka Streams API you first need to add Kafka_2.12 package to your application. You can avail of this package in maven:


Next, you need to make a file streams.properties with the snippet as mentioned below. You need to make sure that you’ve replaced the bootstrap.servers list with the IP addresses of your chosen cluster:

# Kafka broker IP addresses to connect to

# Name of our Streams application

# Values and Keys will be Strings

# Commit at least every second instead of default 30 seconds

To leverage the Streams API with Instacluster Kafka, you also need to provide the authentication credentials. Add the following snippet to your streams.properties file while making sure that the truststore location and password are correct:

ssl.truststore.location = truststore.jks
ssl.truststore.password = instaclustr
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required 

To create the streams application, you need to load the properties mentioned earlier:

Properties props = new Properties();

try {
  props.load(new FileReader("streams.properties"));
} catch (IOException e) {

Make a new input KStream object on the wordcount-input topic:

StreamsBuilder builder = new StreamsBuilder();
KStream&lt;String, String&gt; source = builder.stream("wordcount-input");

Make the word count KStream that will calculate the number of times every word occurs:

final Pattern pattern = Pattern.compile("W+");

KStream counts &nbsp;= source.flatMapValues(value-> Arrays.asList(pattern.split(value.toLowerCase())))
      .map((key, value) -> new KeyValue<Object, Object>(value, value))
      .groupByKey() &nbsp;&nbsp;&nbsp;

You can then direct the output from the word count KStream to a topic named wordcount-output:


Lastly, you can create and start the KafkaStreams object:

KafkaStreams streams = new KafkaStreams(builder.build(), props);

How to Use Kafka Streams?

Kafka Streams gives you the ability to perform powerful data processing operations on Kafka data in real-time. Follow this detailed guide to understand the working of Kafka Streams along with some practical use for implementation.

What is a Stream? – Table and Stream Mechanism

A stream typically refers to a general arrangement or sequence of records that are transmitted over systems. In simple words, a stream is an unbounded sequence of events. You can think of this as just things happening in the world and all of these events are immutable.

Tables are an accumulated representation or collection of streams that are transmitted in a given order. You can think of a table as the current state of the world.  

A bit more technically, a table is a materialized view of that stream of events with only the latest value for each key. Typically, a table acts as an inventory where any process is triggered.

You can go with the tables for performing aggregation queries like average, mean, maximum, minimum, etc on your datasets. And, if you want to have series of events, a dashboard/analysis showing the change, then you can make use of streams.

Kafka allows you to compute values against tables with altering streams. It makes trigger computation faster, and it is capable of working with any data source. This table and stream duality mechanism can be implemented for quick and easy real-time streaming for all kinds of applications.

A basic Kafka Streams API build can easily be structured using the following syntax:

StreamsBuilder builder = new StreamsBuilder();
builder.stream("raw-movies", Consumed.with(Serdes.Long(), Serdes.String())).mapValues(Parser::parseMovie).map((key, movie) -> new KeyValue<>(movie.getMovieId(), movie)).to("movies", Produced.with(Serdes.Long(), movieSerde));

Topic Partitioning

A topic is basically a stream of data, just like how you have tables in databases, you have topics in Kafka. Topics are then split into what are called partitions. So, a partition is basically a part of the topic and the data within the partition is ordered.

Partitioning takes the single topic log and breaks it into multiple logs each of which can live on a separate node in the Kafka cluster. You can have as many partitions per topic as you want.

The benefits with Kafka are owing to topic partitioning where messages are stored in the right partition to share data evenly. It allows the data associated with the same anchor to arrive in order.

The portioning concept is utilized in the KafkaProducer class where the cluster address, along with the value, can be specified to be transmitted, as shown:

try (KafkaProducerString,<Payment> producer = new KafkaProducer<String, Payment>(props)) {
    for (long i = 0; i < 10; i++) {
        final String orderId = "id" + Long.toString(i);
        final Payment payment = new Payment(orderId, 1000.00d);
        final ProducerRecord<String, Payment> record = 
           new ProducerRecord<String, Payment>("transactions", 
} catch (final InterruptedException e) {

The same can be implemented for a KafkaConsumer to connect to multiple topics with the following code:

try (final KafkaConsumer<String, Payment> consumer = new KafkaConsumer<>(props)) {
    while (true) {
        ConsumerRecords<String, Payment> records = consumer.poll(100);
        for (ConsumerRecord<String, Payment> record : records) {
            String key = record.key();
            Payment value = record.value();
            System.out.printf("key = %s, value = %s%n", key, value);

Kafka Connect

Kafka Connect provides an ecosystem of pluggable connectors that can be implemented to balance the data load moving across external systems. In simple words, Kafka Connect is used as a tool for connecting different input and output systems to Kafka.

Using Connect, you can eliminate code with the use of JSON configurations only for the transmission.

A representation of these JSON key and value pairs can be seen as follows:

    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "topics"                          : "my_topic",
    "connection.url"                  : "http://elasticsearch:9200",
    "type.name"                       : "_doc",
    "key.ignore"                      : "true",
    "schema.ignore"                   : "true"

ksqlDB for Stream Processing 

The use of ksqlDB for stream processing applications enables the use of a REST interface for applications to renew stream processing jobs for faster query implementations.

These are defined in SQL and can be used across languages while building an application. For a Java alternative to implementing Kafka Stream features, ksqlDB (KSQL Kafka) can be used for stream processing applications.

The following ksqlDB code implements the Kafka Stream function:

CREATE TABLE rated_movies AS
   SELECT  title,
           sum(rating) / count(rating) AS avg_rating
   FROM ratings
   INNER JOIN movies ON ratings.movie_id = movies.movie_id
   GROUP BY title,

A robust code implementing Kafka Streams will cater to the above-discussed components for increased optimization, scalability, fault-tolerance, and large-scale deployment efficiency. It will look like the structure as shown below:

  import org.apache.kafka.common.serialization.Serdes;
                   import org.apache.kafka.common.utils.Bytes;
                   import org.apache.kafka.streams.KafkaStreams;
                   import org.apache.kafka.streams.StreamsBuilder;
                   import org.apache.kafka.streams.StreamsConfig;
                   import org.apache.kafka.streams.kstream.KStream;
                   import org.apache.kafka.streams.kstream.KTable;
                   import org.apache.kafka.streams.kstream.Materialized;
                   import org.apache.kafka.streams.kstream.Produced;
                   import org.apache.kafka.streams.state.KeyValueStore;
                   import java.util.Arrays;
                   import java.util.Properties;
                   public class WordCountApplication {
                       public static void main(final String[] args) throws Exception {
                           Properties props = new Properties();
                           props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application");
                           props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker1:9092");
                           props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
                           props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
                           StreamsBuilder builder = new StreamsBuilder();
                           KStream<String, String> textLines = builder.stream("TextLinesTopic");
                           KTable<String, Long> wordCounts = textLines
                               .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+")))
                               .groupBy((key, word) -> word)
                               .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as("counts-store"));
                           wordCounts.toStream().to("WordsWithCountsTopic", Produced.with(Serdes.String(), Serdes.Long()));
                           KafkaStreams streams = new KafkaStreams(builder.build(), props);

Code Snippets from Confluent and Kafka.

Difference between Kafka Streams and Kafka Consumer

Kafka Streams is an easy data processing and transformation library within Kafka used as a messaging service. Whereas, Kafka Consumer API allows applications to process messages from topics. Kafka Streams has a single stream to consume and produce, however, there is a separation of responsibility between consumers and producers in Kafka Consumer.

Kafka Streams is capable of performing complex processing but doesn’t support batch processing. Kafka Consumer supports only Single Processing but is capable of Batch Processing.

Kafka Streams supports stateless and stateful operations, but Kaka Consumer only supports stateless operations. Kafka Consumer offers you the capability to write in several Kafka Clusters, whereas Kafka Streams lets you interact with a single Kafka Cluster only.

Connecting Kafka Streams to Confluent Cloud

Here are the steps you can follow to connect Kafka Streams to Confluent Cloud:

  • Step 1: Create a java.util.Properties instance.
  • Step 2: Next, configure your streams application. Kafka and Kafka Streams Configuration options need to be configured in the java.util.Properties instance before you use Streams.

    For this example, you need to configure the Confluent Cloud broker endpoints StreamsConfig.BOOTSTRAP_SERVERS_CONFIG along with SASL config SASL_JAAS_CONFIG. Here is the code snippet for the same:
import java.util.Properties;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.common.config.SaslConfigs;
import org.apache.kafka.streams.StreamsConfig;

Properties props = new Properties();
// Comma-separated list of the Confluent Cloud broker endpoints. For example:
// r0.great-app.confluent.aws.prod.cloud:9092,r1.great-app.confluent.aws.prod.cloud:9093,r2.great-app.confluent.aws.prod.cloud:9094
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "<broker-endpoint1, broker-endpoint2, broker-endpoint3>");
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
props.put(StreamsConfig.SECURITY_PROTOCOL_CONFIG, "SASL_SSL");
props.put(SaslConfigs.SASL_MECHANISM, "PLAIN");
props.put(SaslConfigs.SASL_JAAS_CONFIG, "org.apache.kafka.common.security.plain.PlainLoginModule required 
username="<api-key>" password="<api-secret>";");

// Recommended performance/resilience settings
props.put(StreamsConfig.producerPrefix(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG), 2147483647);
props.put(StreamsConfig.producerPrefix(ProducerConfig.MAX_BLOCK_MS_CONFIG), 9223372036854775807);

// Any further settings
props.put(... , ...);


There are multi-fold features of Apache Kafka that can be harnessed in combination with just an application code. It is data secure, scalable, and cost-efficient for ready use in a variety of systems. In this article, you were introduced to Kafka Streams, a robust horizontally scalable messaging system. You were then provided with a detailed guide on how to use Kafka Streams.

Being open-source software, Kafka is a great choice for enterprises and also holds great untapped commercial potential. Kafka Streams are easy to understand and implement for developers of all capabilities and have truly revolutionized all streaming platforms and real-time processed events.

This article provided you with a detailed guide on Kafka streams, a robust and horizontally scalable messaging system. You were introduced to the important Kafka Stream concepts and were shown how to work around them.

Visit our Website to Explore Hevo

Hevo, with its strong integration with 100+ Data Sources & BI tools such as Kafka (Free Data Source), allows you to not only export data from sources & load data to the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Hevo offers plans & pricing for different use cases and business needs, check them out!

Do you use Kafka? Share your experience of using Kafka Streams in the comment section below.

No-code Data Pipeline for Data Warehouse