Steps to Stream Data from Kafka to Redshift: A Comprehensive Guide

on Data Integration • November 19th, 2020 • Write for Hevo

Introduction

The Kafka Connect Amazon Redshift Sink connector is used to poll data from Kafka into Redshift. In this article let us go through the steps to use the connector to export data from Kafka to Redshift. Here is an outline of this article:

Hevo, A Simpler Alternative to Integrate your Data for Analysis

Hevo offers a faster way to move data from databases or SaaS applications into your data warehouse to be visualized in a BI tool. Hevo is fully automated and hence does not require you to code.

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Security: Hevo is SOC II, GDPR, and HIPPA compliant. Hevo also enables top-grade security with end-to-end encryption, two-factor authentication, and more.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.

You can try Hevo for free by signing up for a 14-day free trial.

Apache Kafka

Apache Kafka is an event streaming platform by Apache Software Foundation (that was originally developed by LinkedIn as Kafka). It handles unified, high throughput, low latency real-time data feeds. Kafka is based on a distributed commit log and from there you can subscribe and publish data to multiple real-time applications. You can write, read, store, and process events using Kafka. 

Kafka provides high-level functions for transformations, aggregations, joins, windowing, etc., to process event streams. Event streaming is used in a wide variety of industries that require real-time access to data, ranging from banking and stock exchange to hospitals and factories.

Kafka Connect

Kafka Connect is an open-source component of Kafka, that is used to connect Kafka with external systems. 

The Kafka Connect Amazon Redshift Sink connector lets you export/stream data from Kafka topics to Redshift by polling data from Kafka and writing it to the Redshift database. It supports Avro, JSON schema, or Protobuf input data formats.

kafka to redshift
Image via AWS

Prerequisites

  • We will use the Kafka Connect Amazon Redshift Sink connector to export data from Kafka topics to the Redshift database. To run Kafka  Connect Amazon Redshift Sink connector, you need:
  • Confluent platform 3.3.0 or above, or Kafka 0.11.0 or above
  • Java 1.8
  • INSERT access privilege (at least)

Steps to Stream Data from Kafka to Redshift

Here are the steps to stream data from Kafka to Redshift.

Step 1: Install the Redshift Connector

  • You can install the latest version of the Redshift connector by running the following command on the Confluent Platform installation directory. 
confluent-hub install confluentinc/kafka-connect-aws-redshift:latest

Step 2: Install the Redshift JDBC Driver

  • The connector needs a JDBC driver to connect to Redshift. The JDBC 4.0 driver JAR file that comes with AWS SDK needs to be placed on each Connect worker node. Download the Redshift JDBC Driver here and place the JAR file into this directory:
share/confluent-hub-components/confluentinc-kafka-connect-aws-redshift/lib
  • Now restart all of the Connect worker nodes.

Step 3: Create an Amazon Redshift Instance

Make sure you have the required permissions to create and administer Redshift instances.

  • Log into AWS Management Console and navigate to Redshift.
  • Go to Clusters and select “Quick Launch Cluster”.
  • Set the “Master User Password” and Click “Launch Cluster”

In a few minutes, your cluster will be available.

Step 4: Load the Connector

  • Create a properties file for your Amazon Redshift Sink Connector.
name=redshift-sink
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
connector.class=io.confluent.connect.aws.redshift.RedshiftSinkConnector
tasks.max=1
topics=orders
aws.redshift.domain=< Required Configuration >
aws.redshift.port=< Required Configuration >
aws.redshift.database=< Required Configuration >
aws.redshift.user=< Required Configuration >
aws.redshift.password=< Required Configuration >
pk.mode=kafka
auto.create=true
  • Enter your cluster configuration parameters as they are in your Cluster Details.
  • Load the redshift-sink connector.
confluent local services connect connector load redshift-sink --config redshift-sink.properties
  • If you don’t use CLI, use the following command:
 <path-to-confluent>/bin/connect-standalone 
<path-to-confluent>/etc/schema-registry/connect-avro-standalone.properties 
redshift-sink.properties
  • Your output will look like this:
{
  "name": "redshift-sink",
  "config": {
    "confluent.topic.bootstrap.servers": "localhost:9092",
    "connector.class": "io.confluent.connect.aws.redshift.RedshiftSinkConnector",
    "tasks.max": "1",
    "topics": "orders",
    "aws.redshift.domain": "cluster-name.cluster-id.region.redshift.amazonaws.com",
    "aws.redshift.port": "5439",
    "aws.redshift.database": "dev",
    "aws.redshift.user": "awsuser",
    "aws.redshift.password": "your-password",
    "auto.create": "true",
    "pk.mode": "kafka",
    "name": "redshift-sink"
  },
  "tasks": [],
  "type": "sink"
}

Step 5: Produce a Record in Kafka

  • Produce a record in the orders topic:
./bin/kafka-avro-console-producer 
--broker-list localhost:9092 --topic orders 
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"id","type":"int"},{"name":"product", "type": "string"}, {"name":"quantity", "type": "int"}, {"name":"price", "type": "float"}]}'
  • Enter the following record in the terminal and click Enter:
{"id": 999, "product": "foo", "quantity": 100, "price": 50}
SELECT * from orders;

Conclusion

You have seen how you can use the Kafka Connect Redshift Sink Connector to stream data from Kafka to Redshift. But if you are looking for an automatic tool that will free you from all the hassle, try Hevo Data.

Hevo is a No-Code Data Pipeline as a service. You can instantly start moving your data in real-time from any source to Redshift or other data destinations. Even if your data changes in the future in the form of new tables, columns, or data types, Hevo manages the schema. It is easy to set-up, requires minimal maintenance, and is extremely secure. 

Take Hevo for a spin to see if you will like it! You can try Hevo for free here.

No-code Data Pipeline for your Data Warehouse