Trillions of data are being handled and streamed through Kafka servers per day. However, Kafka servers only store and organize data received from producers and are not responsible for evaluating the quality and appropriateness of the incoming and outgoing data. This is where Kafka Schema Registry comes in to solve this issue.

With Kafka schema registry, only messages with supported and well-acknowledged data format will be produced into Kafka Servers, allowing consumers to easily fetch correct and appropriate data from the pre-subscribed topics. In this article, you will learn about the Kafka schema registry and how to install, configure, and work with the schema registry.

What is Kafka?

Kafka logo

Apache Kafka is a popular Distributed Data Streaming software that allows for the development of real-time event-driven applications. Being an open-source application, Kafka allows you to store, read, and analyze streams of data free of cost. Kafka is distributed, which means that it can run as a Cluster that spans multiple servers.

Leveraging its distributed nature, users can achieve high throughput, minimal latency, high computation power, etc., and can handle large volumes of data without any perceptible lag in performance.

Written in Scala, Kafka supports data from a large number of external Data Sources and stores them as “Topics”. Kafka employs two functions “Producers” and “Consumers” to read, write, and process events.

Prerequisites

To get started with Kafka schema registry, fundamental knowledge of Apache Kafka, and real-time data streaming is a must.

Simplify Apache Kafka Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up data integration from Apache Kafka and 150+ Data Sources (including 60+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice.

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within a Data Pipeline.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Architecture

Kafka Architecture

In this article, you will go through the steps to implement an Avro data type and which will consume payment records and put them in a Kafka topic. After that, you will implement a consumer for consuming and collecting messages from the topic. 

These collected messages will be exposed to another resource using Server-Sent Events. The payment records will be serialized and deserialized via Avro. The Schema which will describe the payment record is stored in Confluent Avro Serde and Confluent Schema Registry.

Data Serialization Formats

Kafka Schema Registry supports multiple Data Serialization Formats. There are a few points that you should consider while choosing the right Data Serialization Format.

NameBinarySchema – Interface
Description Language
JSONNONO
XMLNOYES
YAMLNONO
AVROYESYES
Protocol BufferYESYES
ThriftYESYES

What is Kafka Schema Registry?

The Kafka schema registry resides apart from the existing Kafka cluster and handles the fields or schema distribution of the incoming data. It stores a copy of the schema from every incoming message in the local cache.

Kafka Schema Registry is the centrally managed, enforced data format registry that enforces the format, or schema, of the data produced and consumed in an Apache Kafka ecosystem. It supports Avro, JSON, and Protobuf schemas, ensuring compatibility in sending Kafka topics from different producers and consumers. It will store and version schemas so that Schema Registry will prevent those data inconsistencies and allow the safe evolution of data formats over time, thus providing fluent serialization and deserialization of data. 

Installing and Configuring Schema Registry in Kafka

Part 1: Installing Schema Registry with Existing Kafka Setup

  • Before installing a Kafka schema registry with your existing Kafka environment, you need to start the Kafka and Zookeeper instances. You have to also make sure that the Java 8+ version is pre-installed and running on your local machine because Kafka needs the latest Java environment to work properly. 
  • Further, set up the Java_Home environment variables and file path to enable your operating system to point towards the Java utilities, making Apache Kafka compatible with JRE (Java Runtime Environment).
  • After the above-mentioned prerequisites are satisfied, you are now ready to start the Kafka and Zookeeper instances. 
  • Initially, you can start the Zookeeper instance. Open a new command prompt and execute the following command.
bin/zookeeper-server-start.sh config/zookeeper.properties
  • After executing the above command, the Zookeeper instance is started successfully. 
  • Then, open a new command prompt and execute the following command to start the Kafka server.
bin/kafka-server-start.sh config/server.properties
  • On executing the above commands, both the Kafka and Zookeeper instances are running successfully on your local machine. Ensure not to accidentally close or terminate the command prompts running the Kafka and Zookeeper instances.
  • Now, you are all set to install a schema registry with the existing Kafka setup.
  • Open a new command prompt and execute the following docker command.
docker run -p 8081:8081 -e  SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=host.docker.internal:2181  -e SCHEMA_REGISTRY_HOST_NAME=localhost  -e SCHEMA_REGISTRY_LISTENERS=http://0.0.0.0:8081  -e SCHEMA_REGISTRY_DEBUG=true confluentinc/cp-schema-registry:5.3.2
  • The docker command installs a schema registry instance to your existing Kafka environment, where 8081 is the port number of the schema registry host. 

Part 2: Installing Schema Registry with the Confluent Platform Package 

  • Initially, you have to download and install the Confluent platform package from the official website of Confluent.
  • You can also download the Confluent platform package from the command prompt itself.
  • Open a new command prompt window, and enter the following command.
curl -O http://packages.confluent.io/archive/6.1/
confluent-community-6.1.1.tar.gz
  • The above command downloads the zip file of the Confluent platform that contains the configuration files to install the Schema registry.
  • To unzip the file, enter the command given below.
tar xzvf confluent-community-6.1.1.tar.gz
  • In the following steps, you will configure the Zookeeper, Kafka, and Schema registry files. Initially, you can configure the Zookeeper instance. If you want to create a single-node configuration, the zookeeper configuration will not need any change or alterations.
  • If you need to change the file or directory path, enter the following command in a new command prompt. 
cd $CONFLUENT_HOME/etc/kafka
vi zookeeper.properties
  • Then, open the “server.properties” file from the unzipped files during installation. As shown below, make changes to the respective sections in the “server.properties” file to complete the configuration of the Kafka server. 
listeners=PLAINTEXT://<FQDN or IP of your host>:9092listeners=PLAINTEXT://localhost:9092
zookeeper.connect=<FQDN or IP of your host>:2181zookeeper.connect=localhost:2181
  • After making changes, save the respective file.
  • Now, you will configure the Kafka schema registry. Open the “schema-registry.properties” file and make the respective changes to the parameters as shown below.
listeners=http://<FQDN or IP of your host>:8081
listeners=http://localhost:8081
kafkastore.bootstrap.servers=PLAINTEXT://<FQDN or IP of your broker host>:9092
kafkastore.bootstrap.servers=PLAINTEXT://localhost:9092
mode.mutability=true
  • Save the respective file after making changes.
  • Now, start the Kafka server and Zookeeper instances in two terminals as you did in the previous method. 
  • After successfully starting and running the Kafka server and Zookeeper instances, you can start the Kafka schema registry. Open a new terminal and enter the following command.
bin/schema-registry-start 
etc/schema-registry/schema-registry.properties
  • On executing the above command, you will get a success message as “INFO Server started, listening for requests.”

After executing the steps, as mentioned above, you successfully configured and started Kafka, Zookeeper, and Kafka schema registry instances. Now, you are all set to deploy and work with Kafka schema registries to manage and organize the Kafka topic schemas.

Explore the differences between Avro and Parquet file formats to choose the best one for your data workflows

Working with the Kafka Schema Registry

  • As you already started Kafka, Zookeeper, and Kafka schema registry instances separately, you can proceed with the further steps.
  • Since you already installed “Confluent Platform,” you can also start the Kafka cluster’s mandatory instances by the single command line as given below.
confluent local services start
  • The above command will start all the related instances of Kafka, and the output will be similar to the image above.
  • Execute the commands given below to clone the confluentinc GitHub repository and navigate your terminal to clients/avro/, thereby getting access to work with Avro client.
git clone https://github.com/confluentinc/examples.gitcd examples/clients/avro
git checkout 7.0.1-post
  • As you pre-installed the Confluent platform, you have access to work with the Control Center web interface, which is a user-friendly and interactive UI. 
  • Navigate to the web interface by following http://localhost:9021/.
  • In the below steps, you will use a topic named “transactions” for producing and consuming messages.
  • The home page of the Control Center web interface will look as shown above. 
  • In the cluster overview panel, select the “Topics” option and click on the “Add a topic” button.
add topic
Source
  • Enter the topic name as “transactions” and click on the “Create with defaults” button. Now, a new topic is successfully created and displayed under the “All topics” section.
new topic
Source
  • Now, you have to fix a basic schema for producing and consuming messages to and fro the Kafka topics. Usually, the schema will resemble a key-value pair.
cat src/main/resources/avro/io/confluent/
examples/clients/basicavro/Payment.avsc
  • The above output includes a data type of message that resembles the Avro data type (key-value). The “record” is one of the data types of Avro, including enum, union, array, map, or fixed.
  • It also has a unique schema name to differentiate among other schema registries.
  • Using this method, producers will serialize the Avro data with continuous real-time messages, while Consumers will deserialize the Avro data according to the preferred messages they want to fetch from the Kafka server.
  • Furthermore, you can write messages into the schema registry as a producer and consume messages from a schema registry as a consumer by subscribing to the specific topic.
transactions
Source
  • Once your schema registry is actively collecting and distributing real-time messages, the actions are updated in the Control Center web interface. 
  • In the “All Topics” menu, navigate to transactions > Schema. You can find the latest schema of the schema registry for the respective Kafka topic called “transactions.”
latest schema in transaction
Source

Efficient Deployment Strategies/considerations of Schema Registry

  • Prerequisites to Deployment: Ensure all the basic logistical, configurational, and post-deployment requirements are met before deploying the Kafka cluster with a schema registry.
  • Logistical Consideration: Verify the hardware settings, particularly memory, CPU disks, and network, can tolerate the output of the Kafka ecosystem in terms of data streaming from one end to the other end.
  • Logical Consideration: All other external applications except the latest version of Java (JVM) must be installed so that no reporting of errors at run time may occur.
  • Configuration Factors: Kafka servers should be configured with Zookeeper instances, port names, hostnames and listeners in the production environment to have the maximum possible throughput.

Conclusion

In this article, you learned about Kafka schema registry and configuring Kafka schema registry with the Kafka clusters. This article mainly focused on configuring Kafka schema registry to stream real-time messages using the Control Center web interface that comes along with the Confluent platform installation.

However, the Kafka schema registry can be managed solely by Confluent CLI or other command-line tools. However, in businesses, extracting complex data from a diverse set of Data Sources can be a challenging task and this is where Hevo saves the day!

Try a 14-day free trial and experience the feature-rich Hevo suite firsthand. Also, checkout our unbeatable pricing to choose the best plan for your organization.

Frequently Asked Questions

1. Is it possible to make a federated query from cloud SQL to BigQuery?

Yes, You can make federated query from Cloud SQL to BigQuery. You have an external data source in BigQuery, and you can run queries on your relational data stored in Cloud SQL without loading it all into BigQuery, which gives you smooth access to data as it sits across both platforms, so you can then do efficient querying and analytics across both.

2. Does AWS have a BigQuery equivalent?

Yes, Amazon offers a strong equivalent of BigQuery in the form of Amazon Redshift. Redshift is one of the fully managed services provided by Amazon that gives data warehouses. Users can accomplish complex queries and big analytics on really large datasets, much like BigQuery works in Google Cloud.

3. Does BigQuery use Google Cloud storage?

Yes, BigQuery makes use of Google Cloud Storage (GCS) as its data storage and management medium. Data can be directly loaded from GCS into BigQuery, and so on, query result exports can be sent back to GCS by the user.

Ishwarya M
Technical Content Writer, Hevo Data

Ishwarya is a skilled technical writer with over 5 years of experience. She has extensive experience working with B2B SaaS companies in the data industry, she channels her passion for data science into producing informative content that helps individuals understand the complexities of data integration and analysis.