Using Pandas Redshift: Analyzing and Visualizing Data Simplified 101

• March 29th, 2022

Pandas Redshift FI

Redshift is Amazon Web Services (AWS) Cloud Data Warehouse. AWS provides an efficient way for organizations to collect, store, and analyze their data using various Business Intelligence tools. Python modules offer a rich ecosystem for getting started quickly and effectively integrating your systems.

This article will demonstrate the easiest way to analyze and visualize using Pandas Redshift for analysis and Plotly’s public platform for displaying beautiful, interactive visuals and graphs in Python.

Table of Contents

What is Pandas?

Pandas Redshift: Pandas logo
Image Source

Pandas is a Python Data Analysis and Manipulation Library that is open-source. It uses a data structure known as a DataFrame to analyze and alter two-dimensional data. Pandas also include Data Cleansing, Data Exploration, and Visualization tools in addition to Data Analysis and Manipulation.

What is Amazon Redshift?

Pandas Redshift: Redshift logo
Image Source

Amazon Redshift, also known as Amazon Web Services Redshift, is a Cloud-based serverless data warehouse offered by Amazon as part of Amazon Web Services (AWS). Many firms prefer it since it is a fully managed and cost-effective Data Warehousing Solution. AWS Redshift is meant to store petabytes of data and perform Real-time Redshift Analytics to provide actionable insights and assist businesses in growing.

In contrast to typical databases, which store data in rows, AWS Redshift is a column-oriented database that saves user data in a columnar format. To perform computation and generate vital insights, Amazon Redshift uses its own compute engine.

Check out our entire guide, AWS Redshift Database in 2021: A Comprehensive Guide, for more information on AWS Redshift. AWS Redshift Best Practices for 2022 is another useful blog that highlights best practices for utilizing Amazon Redshift.

Simplify Data Analysis with Hevo’s No-code Data Pipeline!

Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (including 40+ Free sources) and will let you directly load data from sources to a Data Warehouse or the Destination of your choice like Redshift. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data. 

Get Started with Hevo for Free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer. 
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Connectors: Hevo supports 100+ Integrations to SaaS platforms FTP/SFTP, Files, Databases, BI tools, and Native REST API & Webhooks Connectors. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake, Firebolt, Data Warehouses; Amazon S3 Data Lakes; Databricks; and MySQL, SQL Server, TokuDB, MongoDB, PostgreSQL Databases to name a few.  
  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!

Importing Libraries

In this tutorial, the data you will use is Amazon’s Sample Redshift Data. Although you will not be using JDBC/ODBC, you will be using the psycopg2 package with SQLAlchemy and Pandas Redshift to simplify analyzing and querying data.

from __future__ import print_function 

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd
import os
import requests

Connecting Pandas Redshift

To access your Redshift instance, you need the Redshift Endpoint URL. For example, instances will look something like this: 

datawarehouse.some_chars_here.region_name.redshift.amazonaws.com. 

Once you’ve configured your cluster, Redshift is extremely easy to connect. Include Username, Password, Port, Host, and Database name in this configuration.

redshift_endpoint = os.getenv("REDSHIFT_ENDPOINT")
redshift_user = os.getenv("REDSHIFT_USER")
redshift_pass = os.getenv("REDSHIFT_PASS")
port = 5439
dbname = 'dev'

Redshift Databases can be accessed in several ways, and there are listed two below. In addition, the SQLAlchemy package can be used for more direct access, and the psycopg2 package.

You can execute SQL queries and get results, but SQLAlchemy is a bit more convenient for returning data directly as a dataframe with Pandas Redshift. Additionally, Plotly integrates tightly with Pandas Redshift, making creating interactive graphs with your team easy.

SQLAlchemy

from sqlalchemy import create_engine
engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" 
% (redshift_user, redshift_pass, redshift_endpoint, port, dbname)
engine = create_engine(engine_string)

Psycopg2

import psycopg2
conn = psycopg2.connect(
   host="datawarehouse.cm4z2iunjfsc.us-west-2.redshift.amazonaws.com",
    user=redshift_user,
    port=port,
    password=redshift_pass,
    dbname=dbname)
cur = conn.cursor() # create a cursor for executing queries

1) Loading the Data

Here is the explanation of how to load data from Amazon’s sample database for the Pandas Redshift connection. You can skip this section if you will be working with your own data. 

Start loading the data

cur.execute("""drop table users;
drop table venue;
drop table category;
drop table date;
drop table event;
drop table listing;
drop table sales;""")
conn.commit()
 
aws_key = os.getenv("AWS_ACCESS_KEY_ID") # Required for accessing S3 samples
aws_secret = os.getenv("AWS_SECRET_ACCESS_KEY")

base_copy_string = """copy %s from 's3://awssampledbuswest2/tickit/%s.txt' 
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' 
delimiter '%s';""" # This will be our base COPY string

#Easy step by step instructions to generate each table where we will need COPY data
tables = ["users", "venue", "category", "date", "event", "listing"]
data_files = ["allusers_pipe", "venue_pipe", "category_pipe", "date2008_pipe", "allevents_pipe", "listings_pipe"]
delimiters = ["|", "|", "|", "|", "|", "|", "|"]

#these are COPY statements we'll be using for loading the data;
copy_statements = []
for tab, f, delim in zip(tables, data_files, delimiters):
    copy_statements.append(base_copy_string % (tab, f, aws_key, aws_secret, delim))

# add in Sales data, delimited by 't'
copy_statements.append("""copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt' 
credentials 'aws_access_key_id=%s;aws_secret_access_key=%s' 
delimiter 't' timeformat 'MM/DD/YYYY HH:MI:SS';""" % (aws_key, aws_secret))
 
# Creating Table Statements
cur.execute("""
create table users(
	userid integer not null distkey sortkey,
	username char(8),
	firstname varchar(30),
	lastname varchar(30),
	city varchar(30),
	state char(2),
	email varchar(100),
	phone char(14),
	likesports boolean,
	liketheatre boolean,
	likeconcerts boolean,
	likejazz boolean,
	likeclassical boolean,
	likeopera boolean,
	likerock boolean,
	likevegas boolean,
	likebroadway boolean,
	likemusicals boolean);

create table venue(
	venueid smallint not null distkey sortkey,
	venuename varchar(100),
	venuecity varchar(30),
	venuestate char(2),
	venueseats integer);

create table category(
	catid smallint not null distkey sortkey,
	catgroup varchar(10),
	catname varchar(10),
	catdesc varchar(50));

create table date(
	dateid smallint not null distkey sortkey,
	caldate date not null,
	day character(3) not null,
	week smallint not null,
	month character(5) not null,
	qtr character(5) not null,
	year smallint not null,
	holiday boolean default('N'));

create table event(
	eventid integer not null distkey,
	venueid smallint not null,
	catid smallint not null,
	dateid smallint not null sortkey,
	eventname varchar(200),
	starttime timestamp);

create table listing(
	listid integer not null distkey,
	sellerid integer not null,
	eventid integer not null,
	dateid smallint not null sortkey,
	numtickets smallint not null,
	priceperticket decimal(8,2),
	totalprice decimal(8,2),
	listtime timestamp);

create table sales(
	salesid integer not null,
	listid integer not null distkey,
	sellerid integer not null,
	buyerid integer not null,
	eventid integer not null,
	dateid smallint not null sortkey,
	qtysold smallint not null,
	pricepaid decimal(8,2),
	commission decimal(8,2),
	saletime timestamp);""")
 
for copy_statement in copy_statements: # execute each COPY statement
    cur.execute(copy_statement)
conn.commit()
 
for table in tables + ["sales"]:
    cur.execute("select count(*) from %s;" % (table,))
    print(cur.fetchone())
conn.commit() # Ensure that the data has been processed and commit our statements.

Pandas Redshift connection is complete. You can now start running queries on the data you loaded into Redshift.

2) Analysis and Visualization

Now next step of Pandas Redshift process is to look for insights. You’ll look at some of the user’s preferences and habits as a first step. It is easy to query the database with Pandas and return a dataframe. In this case, retrieving user preferences. What kind of events are they interested in?

df = pd.read_sql_query("""
SELECT sum(likesports::int) as sports, sum(liketheatre::int) as theatre,  
sum(likeconcerts::int) as concerts, sum(likejazz::int) as jazz, 
sum(likeclassical::int) as classical, sum(likeopera::int) as opera,  
sum(likerock::int) as rock, sum(likevegas::int) as vegas,  
sum(likebroadway::int) as broadway, sum(likemusicals::int) as musical, 
state
FROM users 
GROUP BY state
ORDER BY state asc;
""", engine)

A) Plotting Heatmap Graph

Now that DataFrame has been returned, let’s make a quick heatmap using plotly.

data = [
        go.Heatmap(
            z = df.drop('state', axis=1).values,
            x = df.drop('state', axis=1).columns,
            y = df.state,
            colorscale = 'Hot'
        )
    ]
layout = go.Layout(title="State and Music Tastes", yaxis=dict(autotick=False, dtick=1))
py.iplot(Figure(data=data, layout=layout), filename='redshift/state and music taste heatmap', height=1000)
Pandas Redshift: Heatmap
Image Source

Producing this graph is straightforward, and exploring it is even simpler. By examining this particular graph, you can easily see that it is quite popular. Clearly, sports events don’t seem very popular with the users, and certain states have higher preferences (and possibly a more significant number of users) than others.

B) Plotting Box Plots

Making some box plots based on these user preferences is what you are going to do now.

layout = go.Layout(title="Declared User Preference Box Plots",
                yaxis=dict())

data = []
for pref in df.drop('state', axis=1).columns:
    # for every preference type, make a box plot
    data.append(go.Box(y=df[pref], name=pref))

py.iplot(go.Figure(data=data, layout=layout), filename='redshift/user preference box plots')
Pandas Redshift: Box Plots
Image Source 

Sports seem to be a little bit compressed compared to other activities. Maybe there are fewer sports enthusiasts, or our company doesn’t host many sporting events.

Following a quick exploration of a few of the customers, you have found this sports anomaly. Should there be fewer sports events listed? Do users just don’t like sports events? Is there a difference in the number of sales between event types?

C) Plotting Simple Bar Chart

The information needs to be visualized in a simple bar graph to understand it a bit better.

df = pd.read_sql_query("""
SELECT sum(event.catid) as category_sum, catname as category_name
FROM event, category
where event.catid = category.catid
GROUP BY category.catname
""", engine)
 
layout = go.Layout(title="Event Categories Sum", yaxis=dict(title="Sum"))
data = [go.Bar(x=df.category_name, y=df.category_sum)]
py.iplot(go.Figure(data=data, layout=layout))
Pandas Redshift: Bar Chart
Image Source

As soon as you started researching this data, you found a discrepancy between the users’ preferences and the types of events you’re hosting! You need to report this to management! Let’s take a closer look at the events you are listing. Are there any event types trending upward?

df = pd.read_sql_query("""
SELECT sum(sales.qtysold) as quantity_sold, date.caldate  
FROM sales, date
WHERE sales.dateid = date.dateid 
GROUP BY date.caldate 
ORDER BY date.caldate asc;
""", engine)
 
layout = go.Layout(title="Event Sales Per Day", yaxis=dict(title="Sales Quantity"))
data = [go.Scatter(x=df.caldate, y=df.quantity_sold)]
py.iplot(go.Figure(data=data, layout=layout))
Pandas Redshift: Chart
Image Source 

The results seem to be inconclusive, except for the seasonality of our events. Unfortunately, the aggregate graph does not show much here, so it would probably be worthwhile to explore each category a bit more.

df = pd.read_sql_query("""
SELECT sum(sales.qtysold) as quantity_sold, date.caldate, category.catname as category_name  
FROM sales, date, event, category
WHERE sales.dateid = date.dateid 
AND sales.eventid = event.eventid
AND event.catid = category.catid
GROUP BY date.caldate, category_name
ORDER BY date.caldate asc;
""", engine)

You always need to figure out what kind of graph will communicate your message most effectively. Subplots sometimes work well, but sometimes it’s best to plot them all together. Both are easy with Plotly!

data = []
for count, (name, g) in enumerate(df.groupby("category_name")):
    data.append(go.Scatter(
            name=name,
            x=g.caldate,
            y=g.quantity_sold,
            xaxis='x' + str(count + 1),
            yaxis='y' + str(count + 1)
        ))

fig = tls.make_subplots(rows=2,cols=2)
fig['layout'].update(title="Event Sales Per Day By Category")
fig['data'] += data
py.iplot(fig)

Your plot grid is formatted as follows:

[ (1,1) x1,y1 ] [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ] [ (2,2) x4,y4 ]
Pandas Redshift: Plot Grid
Image Source 

While the above subplots seem to tell a compelling story, it should be noted that axes aren’t always aligned with subplots. It would be useful to plot them all together, with lines indicating the categories.

data = []
for name, g in df.groupby("category_name"):
    data.append(go.Scatter(
            name=name,
            x=g.caldate,
            y=g.quantity_sold
        ))

fig = go.Figure()
fig['layout'].update(title="Event Sales Per Day By Category")
fig['data'] += data
py.iplot(fig, filename='redshift/Event Sales Per Day by Category')
Pandas Redshift: Category Plot
Image Source 

This explains the scenario much better and looks much better. You seem to have quite a few events throughout the year, with the exception of a flurry of musicals around March. 

Conclusion 

Python’s rich ecosystem of modules allows you to integrate your systems more efficiently and get up to speed quickly. For example, you can easily create beautiful visualizations and analyze data using Pandas Redshift, plotly, and the SQLAlchemy toolkit. In case you want to export data from various sources into your desired Database/destination like Redshift, then Hevo Data is the right choice for you! 

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice such as Amazon Redshift, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about Pandas Redshift! Let us know in the comments section below!

No-code Data Pipeline for Amazon Redshift