How To Work With Confluent, Avro & Kafka Schema Registry

on Data Integration, Data Streaming, ETL Tutorials, Kafka • March 16th, 2022 • Write for Hevo

Kafka Schema Registry

Trillions of data are being handled and streamed through Kafka servers per day. However, Kafka servers only store and organize data received from producers and are not responsible for evaluating the quality and appropriateness of the incoming and outgoing data. This is where Kafka Schema Registry comes in to solve this issue.

In some rare cases, producers will end up sending wrong or unsupported data formats into Kafka servers. Consequently, consumers who subscribe to the specific Kafka server will also consume inaccurate or unwanted data. To mitigate this issue, Kafka schema registry was introduced to store all possible schemas or formats of incoming and outgoing data.

With Kafka schema registry, only messages with supported and well-acknowledged data format will be produced into Kafka Servers, allowing consumers to easily fetch correct and appropriate data from the pre-subscribed topics. In this article, you will learn about the Kafka schema registry and how to install, configure, and work with the schema registry.

Table of Contents

What is Kafka?

Kafka Schema Registry: Kafka
Image Source: www.indellient.com

Apache Kafka is a popular Distributed Data Streaming software that allows for the development of real-time event-driven applications. Being an open-source application, Kafka allows you to store, read, and analyze streams of data free of cost. Kafka is distributed, which means that it can run as a Cluster that spans multiple servers.

Leveraging its distributed nature, users can achieve high throughput, minimal latency, high computation power, etc., and can handle large volumes of data without any perceptible lag in performance.

Written in Scala, Kafka supports data from a large number of external Data Sources and stores them as “Topics”. Kafka employs two functions “Producers” and “Consumers” to read, write, and process events.

Producers act as an interface between Data Sources and Topics, and Consumers allow users to read and transfer the data stored in Kafka.

The fault-tolerant architecture of Kalka is highly scalable and can handle billions of events with ease. In addition to that, Kafka is super fast and is highly accurate with data records.

Now that you’re familiar with Kafka, let’s dive straight into the Kafka schema registry.

Prerequisites

To get started with Kafka schema registry, fundamental knowledge of Apache Kafka, and real-time data streaming is a must.

Architecture

In this article, you will go through the steps to implement an Avro data type and which will consume payment records and put them in a Kafka topic. After that, you will implement a consumer for consuming and collecting messages from the topic. 

These collected messages will be exposed to another resource using Server-Sent Events. The payment records will be serialized and deserialized via Avro. The Schema which will describe the payment record is stored in Confluent Avro Serde and Confluent Schema Registry.

Data Serialization Formats

Kafka Schema Registry supports multiple Data Serialization Formats. There are a few points that you should consider while choosing the right Data Serialization Format.

NameBinarySchema – Interface
Description Language
JSONNONO
XMLNOYES
YAMLNONO
AVROYESYES
Protocol BufferYESYES
ThriftYESYES

What is Kafka Schema Registry?

Kafka Schema Registry
Image Source: www.i1.wp.com

Apache Kafka comprises a set of distributed and dedicated groups of Kafka brokers or servers present across the Kafka clusters to handle real-time streaming data. Such continuously streaming data is stored and organized in Kafka topics inside Kafka servers.

Furthermore, producers write or publish messages in Kafka topics, while consumers consume or read data from respective Kafka topics. Kafka receives and publishes messages as bytes from the input side (producer) to the output side (consumer).

There is no data verification, or compatibility checks run in between to evaluate the nature of incoming data.

In some rare cases, when a producer sends inappropriate or wrong data with an unsupported data format into the Kafka server, the downstream consumers will break or collapse when trying to read data from that specific topic. To eliminate such complexities, Confluent introduced the Kafka schema registry.

The Kafka schema registry resides apart from the existing Kafka cluster and handles the fields or schema distribution of the incoming data. It stores a copy of the schema from every incoming message in the local cache.

When needed to publish messages in Kafka server, initially, a producer needs to communicate with the schema registry and ensure whether the specific schema or data format is available.

If the schema is unavailable, the producer can append and register a new schema in an Avro data format (key-value pair) inside the schema registry by giving a unique schema ID.

In further end-to-end streaming data, producers and consumers can use the respective schema IDs to produce and consume messages to and fro the Kafka servers.

Simplify Apache Kafka Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from Apache Kafka and 100+ Data Sources (including 30+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice.

It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get started with hevo for free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within a Data Pipeline.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Installing and Configuring Schema Registry in Kafka

You can install the Kafka schema registry in two ways: installing and setting up a schema registry on the existing Apache Kafka Environment or installing the Confluent Platform package that includes the Kafka schema registry by default.

In the below steps, you will learn both ways of installing and configuring the Kafka schema registry.

Installing Schema Registry with Existing Kafka Setup

  • Before installing a Kafka schema registry with your existing Kafka environment, you need to start the Kafka and Zookeeper instances. You have to also make sure that the Java 8+ version is pre-installed and running on your local machine because Kafka needs the latest Java environment to work properly. 
  • Further, set up the Java_Home environment variables and file path to enable your operating system to point towards the Java utilities, making Apache Kafka compatible with JRE (Java Runtime Environment).
  • After the above-mentioned prerequisites are satisfied, you are now ready to start the Kafka and Zookeeper instances. 
  • Initially, you can start the Zookeeper instance. Open a new command prompt and execute the following command.
bin/zookeeper-server-start.sh config/zookeeper.properties
  • After executing the above command, the Zookeeper instance is started successfully. 
  • Then, open a new command prompt and execute the following command to start the Kafka server.
bin/kafka-server-start.sh config/server.properties
  • On executing the above commands, both the Kafka and Zookeeper instances are running successfully on your local machine. Ensure not to accidentally close or terminate the command prompts running the Kafka and Zookeeper instances.
  • Now, you are all set to install a schema registry with the existing Kafka setup.
  • Open a new command prompt and execute the following docker command.
docker run -p 8081:8081 -e  SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=host.docker.internal:2181  -e SCHEMA_REGISTRY_HOST_NAME=localhost  -e SCHEMA_REGISTRY_LISTENERS=http://0.0.0.0:8081  -e SCHEMA_REGISTRY_DEBUG=true confluentinc/cp-schema-registry:5.3.2
  • The docker command installs a schema registry instance to your existing Kafka environment, where 8081 is the port number of the schema registry host. 

Installing Schema Registry with the Confluent Platform Package 

  • Initially, you have to download and install the Confluent platform package from the official website of Confluent.
  • You can also download the Confluent platform package from the command prompt itself.
  • Open a new command prompt window, and enter the following command.
curl -O http://packages.confluent.io/archive/6.1/
confluent-community-6.1.1.tar.gz
  • The above command downloads the zip file of the Confluent platform that contains the configuration files to install the Schema registry.
  • To unzip the file, enter the command given below.
tar xzvf confluent-community-6.1.1.tar.gz
  • In the following steps, you will configure the Zookeeper, Kafka, and Schema registry files. Initially, you can configure the Zookeeper instance. If you want to create a single-node configuration, the zookeeper configuration will not need any change or alterations.
  • If you need to change the file or directory path, enter the following command in a new command prompt. 
cd $CONFLUENT_HOME/etc/kafka
vi zookeeper.properties
  • Then, open the “server.properties” file from the unzipped files during installation. As shown below, make changes to the respective sections in the “server.properties” file to complete the configuration of the Kafka server. 
listeners=PLAINTEXT://<FQDN or IP of your host>:9092listeners=PLAINTEXT://localhost:9092
zookeeper.connect=<FQDN or IP of your host>:2181zookeeper.connect=localhost:2181
  • After making changes, save the respective file.
  • Now, you will configure the Kafka schema registry. Open the “schema-registry.properties” file and make the respective changes to the parameters as shown below.
listeners=http://<FQDN or IP of your host>:8081
listeners=http://localhost:8081
kafkastore.bootstrap.servers=PLAINTEXT://<FQDN or IP of your broker host>:9092
kafkastore.bootstrap.servers=PLAINTEXT://localhost:9092
mode.mutability=true
  • Save the respective file after making changes.
  • Now, start the Kafka server and Zookeeper instances in two terminals as you did in the previous method. 
  • After successfully starting and running the Kafka server and Zookeeper instances, you can start the Kafka schema registry. Open a new terminal and enter the following command.
bin/schema-registry-start 
etc/schema-registry/schema-registry.properties
  • On executing the above command, you will get a success message as “INFO Server started, listening for requests.”

After executing the steps, as mentioned above, you successfully configured and started Kafka, Zookeeper, and Kafka schema registry instances. Now, you are all set to deploy and work with Kafka schema registries to manage and organize the Kafka topic schemas.

Working with the Kafka Schema Registry

  • As you already started Kafka, Zookeeper, and Kafka schema registry instances separately, you can proceed with the further steps.
  • Since you already installed “Confluent Platform,” you can also start the Kafka cluster’s mandatory instances by the single command line as given below.
confluent local services start
Image Source: Confluent
  • The above command will start all the related instances of Kafka, and the output will be similar to the image above.
  • Execute the commands given below to clone the confluentinc GitHub repository and navigate your terminal to clients/avro/, thereby getting access to work with Avro client.
git clone https://github.com/confluentinc/examples.gitcd examples/clients/avro
git checkout 7.0.1-post
  • As you pre-installed the Confluent platform, you have access to work with the Control Center web interface, which is a user-friendly and interactive UI. 
  • Navigate to the web interface by following http://localhost:9021/.
  • In the below steps, you will use a topic named “transactions” for producing and consuming messages.
Kafka Schema Registry Clusters
Image Source: Confluent
  • The home page of the Control Center web interface will look as shown above. 
Kafka Schema Registry: Topics
Image Source: Confluent
  • In the cluster overview panel, select the “Topics” option and click on the “Add a topic” button.
Kafka Schema Registry: New Topic
Image Source: Confluent
  • Enter the topic name as “transactions” and click on the “Create with defaults” button. Now, a new topic is successfully created and displayed under the “All topics” section.
  • Now, you have to fix a basic schema for producing and consuming messages to and fro the Kafka topics. Usually, the schema will resemble a key-value pair.
cat src/main/resources/avro/io/confluent/
examples/clients/basicavro/Payment.avsc
  • The output will look like shown below.
Kafka Schema Registry
Image Source: Confluent
  • The above output includes a data type of message that resembles the Avro data type (key-value). The “record” is one of the data types of Avro, including enum, union, array, map, or fixed.
  • It also has a unique schema name to differentiate among other schema registries.
  • Using this method, producers will serialize the Avro data with continuous real-time messages, while Consumers will deserialize the Avro data according to the preferred messages they want to fetch from the Kafka server.
  • Furthermore, you can write messages into the schema registry as a producer and consume messages from a schema registry as a consumer by subscribing to the specific topic.
  • To check the process of how to write and consume messages from the schema registry, follow the link
  • Once your schema registry is actively collecting and distributing real-time messages, the actions are updated in the Control Center web interface. 
  • In the “All Topics” menu, navigate to transactions > Schema. You can find the latest schema of the schema registry for the respective Kafka topic called “transactions.”

Efficient Deployment Strategies/considerations of Schema Registry

  1. In order to deploy the Kafka cluster with a built-in schema registry, you have to satisfy certain prerequisites before your cluster goes live. For instance, you must fulfill the logistical, configurational, and post-deployment considerations before launching your cluster to production or deployment. 
  2. Logistical considerations allow you to re-check and examine the hardware configurations or entities like memory, CPU, GPU, disks, network, file system, etc.

    These entities must be configured correctly because the entire Kafka ecosystem relies on such factors to perform end-to-end data streaming.

    Logical considerations also include the external software installations that make the Kafka ecosystem run smoothly without any errors or bottlenecks.

    One of the software considerations is having JVM (Java Virtual Machine), i.e., the latest version of Java, which is an essential prerequisite for running the Kafka cluster and working with the Kafka schema registry.
  3. The configuration considerations include altering or configuring the basic setup like Kafka server, Zookeeper instances, port name, hostname, and listeners according to the production environment.

    In the production phase, if these configuration considerations remain the same as in a testing phase, end-users might face issues and cannot achieve maximum throughput via Kafka servers.
  4. The other important deployment strategy or consideration is backing up the external Kafka topics that separately store all the metadata about the schemas, subject/version ID, and compatibility settings.

    As these metadata are not replicated anywhere, the topics become inaccessible when any unexpected event of server failure happens. To mitigate this issue, you can backup the metadata to restore it when the server fails.

    This enables consumers to read the Avro format messages present in the schema registry.

These are some of the deployment strategies and considerations you should ensure before deploying your Kafka cluster to process and stream end-to-end real-time data.

Conclusion

In this article, you learned about Kafka schema registry and configuring Kafka schema registry with the Kafka clusters. This article mainly focused on configuring Kafka schema registry to stream real-time messages using the Control Center web interface that comes along with the Confluent platform installation.

However, the Kafka schema registry can be managed solely by Confluent CLI or other command-line tools. However, in businesses, extracting complex data from a diverse set of Data Sources can be a challenging task and this is where Hevo saves the day!

visit our website to explore hevo

Hevo Data with its strong integration with 100+ Sources & BI tools such as Apache Kafka, allows you to not only export data from sources & load data in the destinations, but also transform & enrich your data, & make it analysis-ready so that you can focus only on your key business needs and perform insightful analysis using BI tools.

Give Hevo Data a try and sign up for a 14-day free trial today. Hevo offers plans & pricing for different use cases and business needs, check them out!

Share your experience of installing and configuring the Kafka schema registry in the comments section below.

No-code Data Pipeline For Your Data Warehouse