Real-Time Redshift Analytics: 5 Simple Steps to Get Started

• March 24th, 2022

Redshift Analytics- Featured Image

Amazon Redshift, a massively popular Big Data Storage Solution, helps businesses accelerate their business operations and analytics through simple SQL operations. AWS Redshift offers a high-performance, cost-effective solution that has garnered the likes of Pfizer, McDonald’s, FOX Corporation, Amazon.com, Intuit, Yelp, Western Digital, and many more. Moreover, with Redshift features like Redshift Analytics, making sense of your data is now practically possible and easily achievable for everyone.

This guide explores two broad concepts- Real-time Data Streaming with Redshift and Redshift Analytics. For Data Streaming, we’ll take the help of Apache Kafka, since Redshift currently doesn’t offer such capabilities for its users. Later, we’ll explore some effective and highly useful AWS Services like AWS Glue, AWS QuickSight, and AWS SageMaker. 

Table of Contents

What Is Amazon Redshift?

Amazon Redshift Logo: Redshift Analytics
Image Source: Nightingale HQ

Amazon Redshift or AWS Redshift is a Cloud-based Serverless Data Warehouse provided by Amazon as a part of Amazon Web Services (AWS). It is a fully managed and cost-effective Data Warehousing Solution preferred by many businesses. AWS Redshift is designed to store petabytes of data and perform Real-time Redshift Analytics to generate actionable insights and help businesses flourish.

AWS Redshift is a column-oriented database that stores user data in a columnar format, compared to traditional databases that store in a row format. Amazon Redshift has its own compute engine to perform computing and generate critical insights. 

To learn more about AWS Redshift, do check out our comprehensive guide here – AWS Redshift Database in 2021: A Comprehensive Guide. We also have another helpful blog featuring best practices while using Amazon Redshift – AWS Redshift Best Practices for 2022.

AWS Redshift Architecture

AWS Redshift has a straightforward architecture. It contains a Leader Node and a cluster of Compute Nodes that perform Redshift analytics on user data. The below snapshot depicts the schematics of AWS Redshift architecture operating Redshift analytics. 

Amazon Redshift Architecture: Redshift Analytics
Image Source: Amazon Docs

AWS Redshift offers JDBC connectors to interact with client applications using major programming languages like Python, Scala, Java, Ruby, etc. Connections to such programming languages have been covered separately in the following guides:

Key Features of Amazon Redshift

  1. Redshift allows users to write queries and export their data back to a Data Lake.
  2. Redshift can seamlessly query the files like CSV, Avro, Parquet, JSON, ORC directly with the help of ANSI SQL.
  3. AWS Redshift offers exceptional support for Machine Learning, and developers can create, train and deploy Amazon SageMaker models using SQL.
  4. Redshift has an Advanced Query Accelerator (AQUA) which performs user queries ten times faster than what is offered by other Cloud Data Warehouses.
  5. Redshift’s Materialistic view allows you to achieve faster query performance for ETL, Batch Job Processing, and Dashboarding.
  6. AWS Redshift has a petabyte scalable architecture, and it can scale quickly as per user need.
  7. Redshift enables secure sharing of the data across Redshift Clusters.
  8. Amazon Redshift provides consistent fast performance, even with thousands of concurrent queries.
Simplify Amazon Redshift ETL with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate data from 150+ Data Sources (including 40+ Free Data Sources) to a destination of your choice such as Amazon Redshift in real-time in an effortless manner.

Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-Time Data Transfer: Hevo provides real-time Data Migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What Is Streaming Data?

Data Streaming: Redshift Analytics
Image Source: Confluent

Streaming data is the data generated by Data Sources such as IoT Devices, Sensors, Telecasts, etc., and are small in size. By small, we mean that each Data Stream is pretty small but can be of heavy volume. For example, a Sensor can generate 1000s of records or log files (of kilobytes) per second.

Streaming data includes a variety of sources such as Mobile Devices, Logs generated from Jobs, Web Applications, Gaming Activities, Social Networks, Geospatial Data, etc. The data generated by these applications are very small (typically in kilobytes) and need to be processed sequentially and incrementally per record or on defined windows. These real-time data can be useful to perform a variety of analytics like aggregations, filtering, and sampling.

Data Analytics derived from such analysis provides information about server activity, website clicks, and geolocation of these devices to the companies, which they can use to grow their brand.

Real-time Streaming With Redshift

AWS Redshift does not come up with its Streaming Services, and hence Redshift needs to connect with an External Streaming Provider such as Kafka, Kinesis to get the real-time data that can be processed later in Redshift.

This blog post will provide information on streaming the data by using Confluent Kafka and then storing it in Redshift.

Prerequisites for Real-time Redshift Analytics and Streaming

This Real-time Redshift Analytics and Streaming tutorial will need the following from users: 

  • An active AWS account. AWS offers free tier services for new users for up to 12 months. You can get more details about the AWS free tier services here. If you’re new to AWS, then sign up here.
  • Confluent Kafka up and running. If you are new to Kafka, refer to this installation guide
  • A well-configured Kafka Topic on which streaming data can be produced and later consumed via Redshift.

Now that we have Redshift and Confluent Kafka up and running, let us look at the detailed step-by-step procedure to stream the data from Kafka to Redshift in this Real-time Redshift Analytics and Streaming guide.

Step 1: Create Redshift Instance

  1. Log into your AWS Management Console.
  2. From the search box at the top, search for Redshift and click to open the Redshift console. 
  3. Navigate to Clusters and click “Quick Launch Cluster”.
  4. Set the “Master User Password.” It would be best to write this master password somewhere for a later step.
  5. Click “Launch Cluster” to complete the setup.
  6. Wait for your cluster to be in the “available” state (typical wait time will be around 5 minutes).
  7. Create a database to store the data from Kafka.
  8. Copy Redshift Domain, Port, user_name, and password for later use in configuring Redshift connector.

Step 2: Install Redshift Connector in Kafka

Confluent Kafka offers a Redshift Sink Connector that can dump the data from Kafka Topics to Redshift tables. Installing a connector in Confluent Kafka is a straightforward process.

  1. On the Confluent Kafka UI, head out to the Connectors section and find the Amazon Redshift Sink connector.
Amazon Redshift Sink Connector: Redshift Analytics
Image Source: Amazon AWS
  1. Click on the Amazon Redshift Sink connector and fill out the requested details as below.
Add Amazon Redshift Sink Connector: Redshift Analytics
Image Source: Amazon AWS
  1. Configure the sink value with the following parameters:
Topicskafka_redshift_topic
NameRedshiftConnector
Kafka API key<kafka_key>
Kafka API secret<kafka_secret>
Input Message FormatAvro
Amazon Redshift Domain<redshift host name>
Amazon Redshift Port5439
Connection User<aws_user>
Connection Password<aws_user_password>
Database Namestreaming-data
Database Timezone<select your timezone>
Auto Create TableTrue
Auto Add ColumnsTrue
Tasks1
Source: AWS Blogs
  1. Once the details are filled in, click on Continue to proceed to the next step. Verify the details once again and hit Launch to deploy your Redshift Analytics and Streaming connector.
Launch Amazon Redshift Sink Connector: Redshift Analytics
Image Source: Amazon AWS
  1. It will take a few minutes to deploy the Redshift Sink (Redshift Analytics and Streaming) connector to Confluent Kafka.

Step 3: Produce a Record in Kafka

We now have a Redshift connector installed in Confluent Kafka and a Redshift instance in AWS. Let us try to produce a record in Kafka that will be streamed to Redshift.

To produce a record in Kafka, we will be using Kafka Avro Console Producer. Below are the steps to produce a record in Kafka Topics.

  1. Open the terminal and navigate to the Confluent installation directory.
  2. Run the below command to produce a record in Kafka topic. 
./bin/kafka-avro-console-producer 
--broker-list localhost:9092 --topic kafka_redshift_topic 
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"emp_id","type":"int"},{"name":"emp_name", "type": "string"}, {"name":"emp_dept", "type": "int"}, {"name":"emp_salary", "type": "float"}]}'
  1. Once you execute the above command, the console will wait for input. Provide the following input to the console.
{"emp_id": 101, "emp_name": "Harry", "emp_dept": 10, "emp_salary": 500}
  1. Once you hit enter, the Kafka Avro Console Producer will stream the data to Kafka Topic, and then the Redshift connector will consume the data and dump it into Redshift.
  2. Navigate to Redshift and find the table created with the name kafka_redshift_topic. Perform a query to view the data.

Redshift Analytics

Now that we have the streaming data in the Redshift, we can now perform Redshift analytics on the data by using other AWS services. There are various services in AWS which can read the data and enable Redshift analytics. Let us discuss some of these services.

AWS Glue

AWS Glue offers on-the-fly ETL capabilities, which means it can connect with your Data Sources and provide in-built ETL functions to transform and perform Redshift analytics on the data on the go.

AWS Glue offers seamless connections with the Redshift database with the help of the JDBC connector. Let us see how you can connect to the database in Redshift from AWS Glue.

Step 1: Log in to the AWS account and from the Services tab, search for AWS Glue.

Step 2: If you see Getting Started, click on it to launch the Glue console.

Step 3: From the left-hand pane on the Glue page, click on Database > Connections and then click on the Add connection button.

Add AWS Glue Connection: Redshift Analytics
Image Source: AWS Workshop

Step 4: Add the JDBC URL, username, password, and other details to set up the connection with Redshift.

Step 5: Once the connection is successful, you can now import the data from Redshift to AWS Glue and start analyzing them by running the Python scripts to perform Data Exploration and Redshift Analytics.

AWS QuickSight

AWS QuickSight is the BI Visualization Tool from AWS Stack. It is a powerful and easy-to-use visualization tool that offers various Data Sources to connect and fetch data from and create outstanding visuals.

Let us see how to connect AWS QuickSight for Data Visualization and Redshift Analytics:

Step 1: Log in to your AWS account and from the Services, search for AWS QuickSight.

Step 2: Click on Manage Data > New Dataset  > Redshift (Auto Discovered). The Redshift auto-discover feature will automatically discover the schemas and tables from Redshift connections.

Redshift QuickSight Connection: Redshift Analytics
Image Source: Redshift Analytics

Step 3: On the next screen, provide the requested information like username, password, data source name, database name, and then click on Create Data Source.

New Redshift Data Source QuickSight Connection: Redshift Analytics
Image Source: Redshift Analytics

Step 4: Once you create the Data Source, all the tables associated with the Redshift database will appear on the next screen. Choose the required table and click Select.

Step 5: Click on Save and Visualize to visualize the data in QuickSight.

Now you can use the various visualization tools and functions to create analytical visualization from the data.

AWS SageMaker

Amazon SageMaker is another service from AWS for Redshift analytics, which is fully managed by AWS. AWS SageMaker enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale. Once you have your data in Redshift, you can create AWS SageMaker notebooks to read data from Redshift to perform Data Exploration, Redshift Analytics, and training and building models out of them.

Let us see how to instantiate an AWS SageMaker Notebook to perform ML Redshift analytics on the Redshift data.

Step 1: Log in to your AWS account and from the Services, search for AWS SageMaker.

Step 2: On the Amazon SageMaker Console, select Notebook > Notebook Instances > Create Notebook Instances.

Create Amazon SageMaker Notebook Instance: Redshift Analytics
Image Source: Redshift Analytics

Step 3: Select Notebook Instance Type (typically ml.t2.large) and IAM role on the Create Notebook Instance page. Also, provide the Network type and click on Create Notebook Instance.

Specify Amazon SageMaker Notebook Instance Fields: Redshift Analytics
Image Source: Redshift Analytics

Step 4: Once the notebook is ready, Click Open Jupyter and create a new notebook.

Jupyter Notebook Instance: Redshift Analytics
Image Source: Redshift Analytics

Step 5: Once the notebook is created, open that directly by double-clicking on it and importing libraries to connect with Redshift.

Now that the connection is established from Redshift, you can perform Exploratory Redshift Analytics on the data and build ML models on it.

This brings us to the end of the Redshift Analytics and Streaming guide. We hope we were able to clarify AWS Redshift concepts and help you achieve your goals. Here are some more blogs to increase your Redshift knowledge:

Recommended Blogs

Conclusion

This blog post discussed methods you can use to stream the data from Kafka to Redshift and utilize Redshift to store the streaming data. With that data, you can easily perform operations like Data Transformations, Data Aggregation, Redshift Analytics and make it ready for downstream application use.

While using AWS Redshift Services are insightful, it is a hectic task to set up and manage the proper environment on a regular basis. Extracting and integrating several heterogeneous sources into your Data Warehouse like Amazon Redshift is also a big task. To make things easier, Hevo comes to your rescue. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built integrations that you can choose from.

Visit our Website to Explore Hevo

Hevo can help you integrate your data from 150+ sources and load them into destinations like Amazon Redshift to analyze real-time data with a BI tool. It will make your life easier and Data Migration hassle-free. It is user-friendly, reliable, and a secure solution for your business needs. 

Want to take Hevo for a spin? Sign Up for a 14-day free trial and see the difference! You can also have a look at the unbeatable pricing that will help you decide the right plan for your business needs.

Share your experience of learning about Redshift Analytics and Data Streaming in the comments section below.

No Code Data Pipeline For Your Amazon Redshift Data Warehouse