Are you trying to access and query your Amazon Redshift data using Python? Well, you have landed on the right article. Amazon Redshift has become one most popular Data Warehousing and Analytics platforms for several businesses. After you have consolidated your data from all your sources into Redshift, you’ll need to analyze it to gain important business insights. You can easily access your data by setting up a Python Redshift connection.

Python is a popular open-source programming language that contains libraries to perform advanced statistical operations for data analysis. By setting up the Python Redshift connection you can query your data and visualize it by generating graphs & charts using the inbuilt Python libraries.

This article will walk you through the process of setting up a Python Redshift connector using 3 popular methods. It will also provide you with a brief overview of Python and Amazon Redshift in further sections.

Prerequisites

You will have a much easier time understanding the ways for setting up the Python Redshift connection if you have gone through the following aspects:

  • An active AWS account.
  • Working knowledge of Python and SQL.
  • Working knowledge of Databases and Data Warehouses.
  • Clear idea regarding the type of data to be transferred.

What is Amazon Redshift?

Pyhton Redshift Connection - Amazon Redshift Logo | Hevo Data
Image Source

Amazon Redshift is a fast, fully managed cloud Data Warehouse that makes it simple & cost-effective to analyze all of the data using standard SQL and existing Business Intelligence (BI) tools. The Redshift architecture is made up of a number of computing resources known as Nodes, which are then grouped into Clusters.

The major benefit of Redshift is its great scalability and quick query processing, which has made it one of the most popular Data Warehouses even today. By adding Nodes in just a few clicks, you can easily scale up your storage and processing performance needs using the AWS Console or Cluster APIs. It’s that simple.

Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Learn the best practices and considerations for setting up high-performance ETL to Redshift

Key Features of Amazon Redshift

Developers at Amazon have continuously improved Redshift over the years. Here are some of the eye-catching features that make Amazon Redshift a popular choice:

  • Scalability: Giving you an edge over traditional Data Warehouses, Redshift allows On-demand horizontal petabyte scalability. With just a few clicks on Amazon Web Services Console or via the Cluster API, you can scale up or down according to your business requirements. You can leverage the Amazon S3 data lake to store unlimited data in open data formats. 
  • Top-notch Performance: The performance-intensive workloads can be handled seamlessly by the R3 instances providing 3 times better performance compared to other alternatives. At no extra cost to you, R3 instances include Advanced Query Accelerator(AQUA) cache to get faster query results for large datasets. For repeated queries, amazon delivers the result immediately from the saved cache. 
  • Flexible Pricing: If you are just starting out, you can opt for hourly plans and scale to long-term plans later on. The scaling costs are also minimal for its customers. For scaling, you can use the Concurrency Scaling Credits that you earn every day from the clusters. You also get a choice between Dense Storage Nodes and Dense Compute Nodes. This allows you to optimally select the resources for your Workload and get the best price performance.  
  • Fully Managed: It is a completely managed service with all the mundane administrative tasks computerized. Features such as Automatic Vacuum Delete, Automatic Table Sort, and Automatic Analyze reduce manual effort, thereby providing a high-class query performance with no delays. The in-built machine learning capabilities apply complex algorithms to improve the query speed dynamically.
  • Reliability: Redshift is extremely fault-tolerant. Cluster health is continuously monitored and automatic data replication from defective storage units is carried out to prevent any data loss. Data is automatically backed up in Amazon S3. Within minutes you can recover your data in a few clicks on AWS Management Console or by using the Redshift APIs. 
  • Secure: Amazon Redshift is a Compliant Cloud Service Provider with SOC1, SOC2, SOC3, and PCI DSS Level 1 requirements. Your data is guarded at all times with accelerated AES-256-bit security at rest and SSL Data Encryption in transit. You can also set access permissions for columns and rows for individuals and groups. You get more control as Redshift allows you to configure the firewall settings and isolate your data warehouse cluster using the Amazon Virtual Private Cloud.

As your number of data sources grows, extracting data from them into Redshift can be a tedious and resource-intensive job. You have to spend a portion of your Engineering bandwidth that builds, monitors and maintains all your data connectors for a seamless zero data loss process.

A more economical and effortless option is to employ a Cloud Based-ETL tool like Hevo Data that allows you to load data into an Amazon Redshift Data Warehouse. The ingested data is first staged in Hevo’s S3 bucket before it is batched and loaded to Amazon Redshift. You can check out the Redshift documentation to know how Hevo simplifies the whole process.

What is Python?

Python Redshift Connection: Python Logo | Hevo Data
Image Source

Python is a language of Web Development, Data Engineering, Data Science, & Machine Learning. It is one of the most used programming languages nowadays. Python provides a ton of libraries that may be used in a range of fields, including Data Engineering, Data Science, Artificial Intelligence, and many more. Pandas, NumPy, and SciPy are some examples of prominent Python packages.

Python allows you to work more quickly and efficiently integrate systems. Python also offers interactive debugging and testing of code snippets, as well as access to all major commercial databases. Additionally, it has a large and growing worldwide community, with major tech giants including Google, Facebook, Netflix, and IBM relying on it.

Key Features of Python

Since its inception, Python has become a popular choice for several tasks due to the following eye-catching features:

  • Interpreted Language: Instead of compiling the whole program into machine instructions, Python is read and executed by an IDLE(Interactive Development Environment). It is an interpreter that executes and displays the output of one line of code at a time.
  • Dynamically Typed: Unlike statically-typed languages like Java, Python doesn’t require you to declare the data type of variable in advance. The Interpreter automatically decides the data type at runtime.
  • Graphical User Interface(GUI) Support: You can easily create GUIs using Python. Modules in Python such as Tkinter, PyQt, wxPython, or Pyside can be used to achieve this. You also get to enjoy a huge number of GUI frameworks and various other cross-platform solutions.
  • Object-Oriented Programming Language: Providing you with a platform to solve real-world problems using the Object-Oriented Approach, Python allows you to implement the concepts of Encapsulation, Inheritance, Polymorphism, etc.
  • Flexible: You can easily write Python Code into C or C++ language & eventually compile that code in C/C++ language. Python is completely compatible with Windows, Mac, and Linux. Hence, if you write your code on Windows, then you don’t need to change it for other platforms.

To know more about Python, visit this link.

Why Connect Python to Redshift?

Setting up the Python Redshift connection can assist you in simplifying tasks such as:

  • Easily access your Amazon Redshift data using Python and execute queries for your business analysis.
  • After querying the data via the Python Redshift connection, employ the powerful statistical techniques available in Python to gain important insights from your Amazon Redshift Data.
  • Use Python libraries such as Numpy to perform advanced numerical computing.
  • With visually stunning & informative graphs, you can visualize your data using Python libraries such as pandas and matplotlib.
Simplify ETL to Redshift Using Hevo’s No-code Data Pipeline

Hevo Data, an Automated No Code Data Pipeline can help you ETL your data swiftly from 150+ other sources (including 40+ free sources) to Redshift in real-time without any assistance from your side, independently or on the go, without writing a single line of code.

You can entrust us with your data transfer process and enjoy a hassle-free experience. This way, you can focus more on Data Analysis, instead of data consolidation.

ETL your data to Redshift for Free

Methods to Set Up Python Redshift Connection

Here, you will be looking at how to connect to Redshift using Python. The main purpose of this article is to implement these functionalities:- 

  • Connecting to the Redshift Data Warehouse instance & querying the data.
  • Loading data into Numpy & Pandas (Two Python Data Analyzing Libraries.) 

Amazon Redshift is based on PostgreSQL, hence most of the features between Redshift & PostgreSQL are similar. Despite similarities, there are significant differences when it comes to query execution & data storage schema. One main difference lies in the architecture of these databases as PostgreSQL is a traditional row-oriented relational database while Redshift is a columnar database more suited for data analytics & hence more appropriate for Data Warehousing.

These differences exist keeping in mind the basic purpose for which Amazon Redshift has been designed i.e. Online Analytical Processing (OLAP) & Business Intelligence (BI) Applications which require complex queries against large datasets. To know more about the architecture of the Redshift Data Warehouse, you can see the official documentation.

In this article, you will get to know the 3 popular methods of setting up a Python Redshift connection:

Method 1: Python Redshift Connection using Python psycopg Driver

Psycopg is the most popular PostgreSQL database adapter for the Python programming language. Its main features are the complete implementation of the Python DB API 2.0 specification and thread safety (several threads can share the same connection). It was designed for heavily multi-threaded applications that create and destroy lots of cursors and make a large number of concurrent “INSERT”s or “UPDATE”s.

Follow these easy steps given below to get started with the Python Redshift connection:

  • Step 1: You can install psycopg driver using the following command:- 
pip install psycopg2

Further details on psycopg can be read on the official link

  • Step 2: The next step in order to set up a Python Redshift connection is to set up proper configuration using standard database access points as shown below by the script of code. 
import psycopg2
conf = { 'dbname' : 'database_name', 
      'host' : 'redshoft_endpoint', 
      port= 'redshift_port', 
      user= 'usr_name', 
      password= 'user_password'}
  • Step 3: You can now wrap the above configuration into a function so that any errors might be handled properly. 
def create_conn(*args, **kwargs):
 
          config = kwargs['config']
          try:
              conn=psycopg2.connect(dbname = config['dbname'],
                                    host = config['host']
                                    port = config['port']
                                    user = config['user']
                                    password = config['password'])
            except Exception as err:
                print err.code, err
            return conn

You can read about the *args (non-keyword arguments) & **kwargs (keyword arguments) parameters in more detail at the following link. In short, they are used when quantities of input parameters are not defined. 

  • Step 4: Finally, you need to finish the Python Redshift connection part & retrieve all the tables in the search path. This snippet will complete the Python Redshift connection process & get connected to the Redshift database. 
def select(*args,**kwargs):
    # need a connection with dbname=<username>_db
    cur = kwargs['cur']
 
    try:
        # retrieving all tables in search_path
        cur.execute("""select tablename""")
    except Exception as err:
            print err.code,err
 
    rows = cur.fetchall()
    for row in rows:
        print row
 
print 'start'
conn = create_conn(config=configuration)
cursor = conn.cursor()
print 'start select'
select(cur=cursor)
print 'finish'

Executing Queries on Redshift Data using psycopg

Once the Python Redshift connection has been set up, You can start querying your Redshift data by following the simple steps given below:

  • Step 1: You will have to use the cursor that was obtained in the last step of the Python Redshift connection by writing the following command:
cursor = conn.cursor()
  • Step 2: After getting the cursor in the ‘cur’ variable, you can use it to execute SQL queries to fetch data.  
cursor.execute("SELECT * FROM `table`;")
  • Step 3: After fetching the required data, you can use psycopg to fetch the data from the database. To fetch the complete data, you can simply run this command.  
cursor.execute("SELECT * FROM `table`;")

You can also try out the following Psycopg Update Query:  

cursor.execute("UPDATE Table_name SET price=%s WHERE id=%s", (uPrice, uId))
  • Step 4: In the end, finally, the cursor & Python Redshift connection can be closed by using the following commands.
cursor.close() 
conn.close()

Method 2: Python Redshift Connection using Python ODBC Driver

This is another way of setting up a Python Redshift connection using ODBC Driver. Follow the simple steps given below to get started with the Python Redshift connection:

  • Step 1: ODBC Driver needs to get installed first & it needs configuration as well. Once you have ODBC Driver configured, you can use Python commands to run the Python Redshift connection.  
import pyodbc
cnxn = pyodbc.connect('DRIVER={ODBC Driver for Amazon Redshift};
Server=myserver;Database=mydatabase;User_ID=myuser_id;Password=my_password;Port=myport;String Types=Unicode')
  • Step 2: After this, you can run a simple query to get your Python Redshift connection tested. Below is an example of the insert query to insert a record in the table. 
cursor = cxn.cursor()
cursor.execute("INSERT INTO CITY (CNAME, CPLACE, CPOSTNO.) VALUES ('Karachi', 'Port Sea', 'JCX12')")

Similarly, you can execute all other queries mentioned above as well to further justify the Python Redshift connection. 

When talking about data analysis in Python, NumPy & Pandas are the most popular libraries.

  • Numpy: Python Library is used mainly for computing purposes. 
  • Pandas: A widely used Data Analysis library. Provides a high-performance data structure known as DataFrame.

Data Loading in Numpy

It’s a very simple process. Use the following commands. 

import numpy as np
data = np.array(cur.fetchall())

Data Loading into Pandas

Similarly, you can use Pandas as well for further data analysis but the data loading process is a bit different as compared to Numpy. 

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://fz:demo@redshift_host:<port_no>/mydatabase')
data_frame = pd.read_sql('SELECT * FROM `table`;', engine)

Once data is uploaded, it can be analyzed using further Python Libraries & methods. With this, you have successfully set up a Python Redshift connection.

Method 3: Python Redshift Connector by AWS

Another way to set up the Python Redshift connection is by using the Redshift Connector for Python provided by Amazon Web Services. Along with seamless integration with pandas and NumPy, this connector is compatible to work with numerous Amazon Redshift features including:

  • IAM authentication
  • Identity provider (IdP) authentication
  • Redshift-specific data types

To use the Redshift connector for setting up the Python Redshift connection, follow the simple steps given below:

  • Step 1: To get started with the Python Redshift connection, firstly you need to install the package for the Python Connect to Redshift from the source by cloning this repository.
$ git clone https://github.com/aws/amazon-redshift-python-driver.git
$ cd redshift_connector
$ pip install .
  • Step 2: After the DB-API specification, you will notice that the autocommit is off by default. You can toggle it ON using the autocommit property of the Python Redshift Connection. While performing this operation, ensure that you are not in a transaction.
con.rollback()
con.autocommit = True
con.run("VACUUM")
con.autocommit = False
  • Step 3: You can configure the paramstyle of the cursor using cursor.paramstyle.Set to Format by default, you can use qmark, numeric, named, format, pyformat for paramstyle.  
# qmark
redshift_connector.paramstyle = 'qmark'
sql = 'insert into foo(bar, jar) VALUES(?, ?)'
cursor.execute(sql, (1, "hello world"))

# numeric
redshift_connector.paramstyle = 'numeric'
sql = 'insert into foo(bar, jar) VALUES(:1, :2)'
cursor.execute(sql, (1, "hello world"))

# named
redshift_connector.paramstyle = 'named'
sql = 'insert into foo(bar, jar) VALUES(:p1, :p2)'
cursor.execute(sql, {"p1":1, "p2":"hello world"})

# format
redshift_connector.paramstyle = 'format'
sql = 'insert into foo(bar, jar) VALUES(%s, %s)'
cursor.execute(sql, (1, "hello world"))

# pyformat
redshift_connector.paramstyle = 'pyformat'
sql = 'insert into foo(bar, jar) VALUES(%(bar)s, %(jar)s)'
cursor.execute(sql, {"bar": 1, "jar": "hello world"})

You can consider the following basic example for setting the Python Redshift connection using the Redshift connector.

import redshift_connector

# Connects to Redshift cluster using AWS credentials
conn = redshift_connector.connect(
    host='examplecluster.abc123xyz789.us-west-1.redshift.amazonaws.com',
    database='mydatabase',
    user='AmazonUser',
    password='new_password'
 )

cursor: redshift_connector.Cursor = conn.cursor()
cursor.execute("create Temp table employee(firstname varchar,lastname varchar)")
cursor.executemany("insert into employee (firstname, lastname) values (%s, %s)",
                    [
                        ('John', 'Smith'),
                        ('Mike', 'Tatum')
                    ]
                  )
cursor.execute("select * from employee")

result: tuple = cursor.fetchall()
print(result)
>> (['John', 'Smith'], ['Mike', 'Tatum'])

Conclusion

This article provides an introduction to Python and Amazon Redshift. Moreover, you learned about 3 methods that you can use to connect to Redshift using Python. Setting up Python Redshift Integration can help you to access and query your Amazon Redshift data with ease. However, loading data from any source to Redshift manually is a tough nut to crack. You will have to write a complex custom script from scratch and invest a lot of time and resources. Furthermore, such a method for Python Redshift connection will require high maintenance and regular debugging.

Visit our Website to Explore Hevo

Hevo Data provides an Automated No-code Data Pipeline that empowers you to overcome the above-mentioned limitations. You can leverage Hevo to seamlessly transfer data from various sources to Redshift in real-time without writing a single line of code. Hevo’s Data Pipeline enriches your data and manages the transfer process in a fully automated and secure manner.

Hevo caters to 150+ data sources (including 40+ free sources) and can directly transfer data to Data Warehouses, or any other destination of your choice in a hassle-free manner. It will make your life easier and make data migration hassle-free.

Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Do check out the pricing details to understand which plan fulfills all your business needs.

Share your experience of setting up a Python Redshift connection in the comments section below!

Muhammad Faraz
Technical Content Writer, Hevo Data

Muhammad Faraz is an AI/ML and MLOps expert with extensive experience in cloud platforms and new technologies. With a Master's degree in Data Science, he excels in data science, machine learning, DevOps, and tech management. As an AI/ML and tech project manager, he leads projects in machine learning and IoT, contributing extensively researched technical content to solve complex problems.

No-code Data Pipeline for your Data Warehouse