Working with Boto3 Redshift SDK: Made Easy

By: Published: December 30, 2021

Boto3 Redshift- Featured Image

Big Data storage solutions like Amazon Redshift can be used for a variety of purposes. Amazon Redshift can help businesses accelerate their business operations and analytics through simple SQL operations. Redshift is a high-performance and cost-effective solution used by prominent companies like Pfizer, McDonald’s, FOX Corporation, Amazon.com, Intuit, Yelp, Western Digital, etc. Moreover, Gartner, Forrester, IDC, and many more have named Amazon Redshift a market leader in Data Warehousing Solutions.

While Amazon Redshift is built around industry-standard SQL, by no means it is confined to just SQL. Users can connect, access, and query their data in Redshift using Python, one such method being the boto3 Redshift client support. Boto3 Redshift SDK is a Python SDK that makes it easy to use the Python programming language to manipulate AWS resources and automation infrastructure. Read on to find out more about the boto3 Redshift Example, Query and Client. 

Table of Contents

What is the Boto3 Redshift SDK?

Boto3 is the name of AWS SDK for Python. It enables you to link your Python application or script or library with AWS Services. It provides native support in Python 2.7+ and 3.4+. Boto3 Redshift SDK provides two levels of APIs: 

  • Client (low-level) APIs: Client APIs map the underlying HTTP API operations one to one. 
  • Resource APIs: Resource APIs provide resource objects and collections to access attributes and perform actions. Boto3 Redshift also provides a feature called ‘waiters’ that keeps polling an AWS resource for a pre-defined status change. For example, after you start an EC2 instance, a waiter can wait until it reaches the ‘running’ state. You can read more about waiters for boto3 Redshift here.

Note: You can find more information on Redshift Data APIs here- Working with Redshift Data APIs Simplified 101.

Another offering of boto3 Redshift SDK is paginators. Some AWS operations return partial/incomplete results and require subsequent requests to get the entire result. An example is a list_objects operation on S3. It can return a maximum of 1000 objects and subsequent requests are required to get the complete list. You can read more about paginators here.

With the introduction to both AWS Redshift and boto3 Redshift SDK done, let’s see how to use boto3 for performing operations on Redshift.

Prerequisites for Boto3 Redshift SDK Setup

  • Proficiency with Python is a must. 
  • An AWS account with a provisioned Redshift cluster is necessary. You are eligible for a 2 month free trial of Redshift if you have never created a cluster in the past. You can find more details here.
  • Familiarity with Redshift will help.
  • In this article, we will run the Python queries on a lambda function. You are free to run it elsewhere (EC2 for example). You should know the setup required for the platform on which you plan to run your queries (for example you should know how to create/deploy a lambda function if you wish to run the query on lambda).
  • It is expected that you are familiar with the creation of IAM policies and the addition of those policies to IAM roles.
Simplify Amazon Redshift ETL with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate data from 100+ Data Sources (including 40+ Free Data Sources) to a destination of your choice such as Amazon Redshift in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line. 

GET STARTED WITH HEVO FOR FREE

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-Time Data Transfer: Hevo provides real-time Data Migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Setting Up Boto3 Redshift SDK

Boto3 has an extensive API reference for Redshift which you can find here. As you can see from the article, the API reference contains the low-level Client API, Paginators, and Waiters. We will not go over all of the possible Client Methods, Paginators, and Waiters in this tutorial. But we’ll cover just enough to get you started with the boto3 API reference for Redshift and make you feel at ease.

Getting Started with Boto3 Redshift SDK

It all starts with creating a client. It is quite simple. See the code snippet below:

import boto3
client = boto3.client('redshift')

While a statement like the above will work on lambda, you may need to provide additional arguments on other platforms.

client = boto3.client('redshift', region_name = 'us-east-1', aws_access_key_id = KEY, aws_secret_access_key = SECRET)

Next, you need to make sure that you have the ID of the cluster of your interest. If you are unsure, go to the Redshift Console, click on ‘Clusters’ from the left menu, and scroll down to find the list of clusters. The name of the cluster is the cluster-ID.

Redshift Clusters: Boto3 Redshift
Image Source: Self

Create the Policy and Attach it to the Lambda Role

In this tutorial, we will make the boto3 Redshift API calls using a lambda function. You can modify/skip this part if you plan to make the calls through some other platform. 

Go to the IAM console, and create a policy. Now, to get started very quickly, you can skip this step and simply use the AWS-managed AmazonRedshiftFullAccess policy. However, you may want to give some restricted Redshift access to your lambda, in which case, you should create your own policy. You can use the visual editor or edit the JSON directly. For this tutorial, to keep things simple, we will provide lambda the read and list permissions to all Redshift resources. 

Read and List Permissions: Boto3 Redshift
Image Source: Self

The JSON version of the policy looks like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeHsmConfigurations",
                "redshift:ListDatabases",
                "redshift:DescribeClusterTracks",
                "redshift:DescribeEvents",
                "redshift:DescribeUsageLimits",
                "redshift:DescribeDefaultClusterParameters",
                "redshift:DescribeEventCategories",
                "redshift:DescribeClusterSubnetGroups",
                "redshift:DescribeQuery",
                "redshift:DescribeReservedNodeOfferings",
                "redshift:DescribeSnapshotSchedules",
                "redshift:DescribeDataSharesForProducer",
                "redshift:DescribeClusterVersions",
                "redshift:DescribeClusterSnapshots",
                "redshift:DescribeSavedQueries",
                "redshift:DescribeLoggingStatus",
                "redshift:DescribeStorage",
                "redshift:DescribeTableRestoreStatus",
                "redshift:DescribeClusterParameters",
                "redshift:DescribeClusterSecurityGroups",
                "redshift:DescribeEventSubscriptions",
                "redshift:DescribeOrderableClusterOptions",
                "redshift:DescribeHsmClientCertificates",
                "redshift:DescribeAccountAttributes",
                "redshift:ViewQueriesInConsole",
                "redshift:DescribeScheduledActions",
                "redshift:DescribeTags",
                "redshift:DescribeClusterParameterGroups",
                "redshift:DescribeClusterDbRevisions",
                "redshift:FetchResults",
                "redshift:ViewQueriesFromConsole",
                "redshift:ListTables",
                "redshift:ListSchemas",
                "redshift:DescribeSnapshotCopyGrants",
                "redshift:DescribeDataSharesForConsumer",
                "redshift:DescribeAuthenticationProfiles",
                "redshift:DescribeReservedNodes",
                "redshift:DescribeNodeConfigurationOptions",
                "redshift:ListSavedQueries",
                "redshift:DescribeClusters",
                "redshift:DescribeDataShares",
                "redshift:DescribeResize",
                "redshift:GetReservedNodeExchangeOfferings",
                "redshift:DescribeTable"
            ],
            "Resource": "*"
        }
    ]
}

You can restrict the policy to specific resources (recommended), and also provide write permissions if required. 

Once the IAM policy is created, attach it to the IAM role of your lambda function. Even if you are using any AWS-managed policy instead of creating your own, don’t forget to attach it to the IAM role of your lambda function.

Make your First Boto3 Redshift API Call

Let’s call the describe_clusters() operation. It will, as the name suggests, describe the clusters for us.

Replace the lambda code with the following:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    client = boto3.client('redshift')
    response = client.describe_clusters(
        MaxRecords=30
    )
    print(response)
    return json.loads(json.dumps(response, indent=4, sort_keys=False, default=str))

Add the name of your cluster-ID in place of ‘your-cluster-id’. Deploy it and trigger the lambda. You should see a response like the one below:

{
  "Clusters": [
    {
      "ClusterIdentifier": "redshift-cluster-1",
      "NodeType": "dc2.large",
      "ClusterStatus": "available",
      "ClusterAvailabilityStatus": "Available",
      "MasterUsername": "awsadminuser",
      "DBName": "dev",
      "Endpoint": {
        "Address": "redshift-cluster-1.abcdefgh.your_region.redshift.amazonaws.com",
        "Port": 5439
      },
      "ClusterCreateTime": "2021-12-08 16:46:06.078000+00:00",
      "AutomatedSnapshotRetentionPeriod": 1,
      "ManualSnapshotRetentionPeriod": -1,
      "ClusterSecurityGroups": [],
      "VpcSecurityGroups": [
        {
          "VpcSecurityGroupId": "sg-0dbfe08092eccxyzw",
          "Status": "active"
        }
      ],
      "ClusterParameterGroups": [
        {
          "ParameterGroupName": "default.redshift-1.0",
          "ParameterApplyStatus": "in-sync"
        }
      ],
      "ClusterSubnetGroupName": "default",
      "VpcId": "vpc-02ed9c6d864dwwwwww",
      "AvailabilityZone": "us-east-2c",
      "PreferredMaintenanceWindow": "thu:05:30-thu:06:00",
      "PendingModifiedValues": {},
      "ClusterVersion": "1.0",
      "AllowVersionUpgrade": true,
      "NumberOfNodes": 1,
      "PubliclyAccessible": false,
      "Encrypted": false,
      "ClusterPublicKey": "ssh-rsa public_key Amazon-Redshiftn",
      "ClusterNodes": [
        {
          "NodeRole": "SHARED",
          "PrivateIPAddress": "xxx.xx.xx.xxx",
          "PublicIPAddress": "yy.yyy.yyy.yy"
        }
      ],
      "ClusterRevisionNumber": "33904",
      "Tags": [],
      "EnhancedVpcRouting": false,
      "IamRoles": [],
      "MaintenanceTrackName": "current",
      "DeferredMaintenanceWindows": [],
      "NextMaintenanceWindowStartTime": "2021-12-16 05:30:00+00:00",
      "AvailabilityZoneRelocationStatus": "disabled",
      "ClusterNamespaceArn": "arn:aws:redshift:your_region:your_account:namespace:aae9c14b-e6f5-427f-9b03-d76d1wwwwwwww",
      "TotalStorageCapacityInMegaBytes": 400000,
      "AquaConfiguration": {
        "AquaStatus": "disabled",
        "AquaConfigurationStatus": "auto"
      }
    }
  ],
  "ResponseMetadata": {
    "RequestId": "e44461b7-46e1-400e-ad3c-bfa7ba410377",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "e44461b7-46e1-400e-ad3c-bfa7ba410377",
      "content-type": "text/xml",
      "content-length": "3803",
      "vary": "accept-encoding",
      "date": "Tue, 14 Dec 2021 17:05:55 GMT"
    },
    "RetryAttempts": 0
  }
}

As you can see, all the details pertaining to your clusters (parameter group, namespace, endpoint, etc.) are returned. I had just one cluster in Redshift, so the ‘Clusters’ array had only one item. 

In the API reference, the output can be restricted using either the ClusterIdentifier argument (in which case only the description of that specific cluster is returned) or using the TagKeys or TagValues argument (in which case only the clusters having matching keys or values will be returned).

Boto3 Redshift API Call Example with Pagination

We’ve asked for a maximum of 30 records in the above API call. However, if there are more clusters to be returned, you can use pagination. The corresponding paginator for this method can be found here.

A paginated version of the above query would look like this:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    responses = []
    client = boto3.client('redshift')
    paginator = client.get_paginator('describe_clusters')
    response = paginator.paginate(
            PaginationConfig={
                'MaxItems': 20,
                'PageSize': 20,
            }
        ).build_full_result()
    responses.append(response)
        
    while ('NextToken' in response):
        response = paginator.paginate(
            PaginationConfig={
                'MaxItems': 20,
                'PageSize': 20,
                'StartingToken': response['NextToken']
            }
        ).build_full_result()
        responses.append(response)
    return json.loads(json.dumps(responses, indent=4, sort_keys=False, default=str))

As you can see, the paginator is run initially without a starting token. We then check if the response contains the ‘NextToken’ key. If it does, that means that there are more results yet to be fetched. Therefore, we run the paginator again, till there is no ‘NextToken’ key in the response.

If you just have a single cluster, and yet want to see the pagination in action, you can add more than 20 tags to that cluster, and then run pagination on describe_tags

Boto3 Redshift API Call Example with Waiter

As described in the introduction, a waiter polls an API till a successful response is received. For Redshift, four waiters are available currently in boto3: 

  • ClusterAvailable, 
  • ClusterDeleted, 
  • ClusterRestored, and 
  • SnapshotAvailable. 

Let us consider the ClusterAvailable waiter. This polls the describe_clusters() method every 60 seconds until it receives a success. It returns an error after 30 failed checks. This is useful for waiting till the cluster becomes available and then performing further actions. The lambda code for the waiter will look like this:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    client = boto3.client('redshift')
    waiter = client.get_waiter('cluster_available')
    
    waiter.wait(
        ClusterIdentifier='redshift-cluster-1',
        MaxRecords=20,
        WaiterConfig={
            'Delay': 30,
            'MaxAttempts': 5
        }
    )
        
    response = client.describe_clusters()

    #perform other actions
    
    return json.loads(json.dumps(response, indent=4, sort_keys=False, default=str))

As you can see, the waiter doesn’t return anything. It just introduces a delay till the required condition is satisfied (the availability of the ‘redshift-cluster-1’ cluster in this case). We can configure the waiter by specifying the delay between successive attempts, and the maximum number of attempts in the WaiterConfig argument.

In case the waiter doesn’t get a successful response even after maximum attempts are exceeded, the lambda function returns a runtime error like the one below:

{
  "errorMessage": "Waiter ClusterAvailable failed: Max attempts exceeded",
  "errorType": "WaiterError",
  "stackTrace": [
    [
      "/var/task/lambda_function.py",
      15,
      "lambda_handler",
      "'MaxAttempts': 2"
    ],
    [
      "/var/runtime/botocore/waiter.py",
      53,
      "wait",
      "Waiter.wait(self, **kwargs)"
    ],
    [
      "/var/runtime/botocore/waiter.py",
      365,
      "wait",
      "last_response=response,"
    ]
  ]
}

Querying Data Using Boto3 Redshift SDK

In case you wish to query data from Redshift, you can refer to the boto3 Redshift API reference for Redshift data. You can know more about using the Amazon Redshift Data API to interact with Amazon Redshift clusters here

Another non-boto3 option is to get the database credentials, and then run the query using the traditional psycopg2.

Conclusion

In this article, we introduced you to the AWS boto3 Redshift SDK. We saw how to make method calls on the boto3 Redshift client, how to use Paginators, and also Waiters. We also saw the resources to help get started on SQL queries execution using the boto3 Redshift Data Client. 

While using AWS Redshift Services are insightful, it is a hectic task to set up and manage the proper environment on a regular basis. Extracting and integrating several heterogeneous sources into your Data Warehouse like Amazon Redshift is also a big task. To make things easier, Hevo comes to your rescue. Hevo Data is a No-code Data Pipeline and has awesome 100+ pre-built integrations that you can choose from.

Visit our Website to Explore Hevo

Hevo can help you integrate your data from 100+ sources and load them into destinations like Amazon Redshift to analyze real-time data with a BI tool. It will make your life easier and Data Migration hassle-free. It is user-friendly, reliable, and a secure solution for your business needs. 

Want to take Hevo for a spin? Sign Up for a 14-day free trial and see the difference! You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about the boto3 Redshift SDK in the comments section below.

Yash Sanghvi
Freelance Technical Content Writer, Hevo Data

Yash has helped companies succeed online and is specialized in the data industry. He provides comprehensive and informative content for those delving deeper into the complexities of data integration and analysis.

No-code Data Pipeline For your Amazon Redshift