Working with Boto3 Redshift SDK: Made Easy

Q: 2. How many queries can Redshift handle?

By default, Redshift can handle 50 concurrent queries , with the ability to scale this using Workload Management (WLM) and Concurrency Scaling to manage query load during spikes.

Q: 3. How much data can Redshift handle?

Redshift can scale to handle petabytes of data with RA3 nodes for independent storage scaling, and Redshift Spectrum enables querying unlimited data stored in Amazon S3 .

Big Data storage solutions like Amazon Redshift can be used for a variety of purposes. Amazon Redshift can help businesses accelerate their business operations and analytics through simple SQL operations. Redshift is a high-performance and cost-effective solution used by prominent companies like Pfizer, McDonald’s, FOX Corporation, Amazon.com, Intuit, Yelp, Western Digital, etc. Moreover, Gartner, Forrester, IDC, and many more have named Amazon Redshift a market leader in Data Warehousing Solutions.

While Amazon Redshift is built around industry-standard SQL, by no means it is confined to just SQL. Users can connect, access, and query their data in Redshift using Python, one such method being the boto3 Redshift client support. Boto3 Redshift SDK is a Python SDK that makes it easy to use the Python programming language to manipulate AWS resources and automation infrastructure. Read on to find out more about the boto3 Redshift Example, Query and Client.

Hevo’s fully managed solution not only streamlines data transfer into Amazon Redshift but also ensures your data is analysis-ready. Its fault-tolerant architecture guarantees secure and consistent data handling with zero data loss, so you can focus on deriving insights from your data.

Why Choose Hevo for Amazon Redshift?

Secure Data Handling: Hevo’s fault-tolerant architecture ensures that your data is securely processed with no risk of data loss.
Seamless Schema Management: Hevo automatically detects the schema of incoming data and maps it to Redshift, simplifying schema management.
User-Friendly Interface: With its intuitive UI, Hevo is easy to use, even for beginners, enabling quick setup and smooth data operations.

Track your data flow into Amazon Redshift and monitor its status at any time with Hevo

Get Started with Hevo for Free

Table of Contents

What is the Boto3 Redshift SDK?

Boto3 is the name of AWS SDK for Python. It enables you to link your Python application or script or library with AWS Services. It provides native support in Python 2.7+ and 3.4+. Boto3 Redshift SDK provides two levels of APIs:

Client (low-level) APIs: Client APIs map the underlying HTTP API operations one to one.
Resource APIs: Resource APIs provide resource objects and collections to access attributes and perform actions. Boto3 Redshift also provides a feature called ‘waiters’ that keeps polling an AWS resource for a pre-defined status change. For example, after you start an EC2 instance, a waiter can wait until it reaches the ‘running’ state.

Another offering of boto3 Redshift SDK is paginators. Some AWS operations return partial/incomplete results and require subsequent requests to get the entire result. An example is a list_objects operation on S3. It can return a maximum of 1000 objects and subsequent requests are required to get the complete list.

With the introduction to both AWS Redshift and boto3 Redshift SDK done, let’s see how to use boto3 for performing operations on Redshift.

Prerequisites for Boto3 Redshift SDK Setup

Proficiency with Python is a must.
An AWS account with a provisioned Redshift cluster is necessary. You are eligible for a 2 month free trial of Redshift if you have never created a cluster in the past.
Familiarity with Redshift will help.
In this article, we will run the Python queries on a lambda function. You are free to run it elsewhere (EC2 for example). You should know the setup required for the platform on which you plan to run your queries (for example you should know how to create/deploy a lambda function if you wish to run the query on lambda).
It is expected that you are familiar with the creation of IAM policies and the addition of those policies to IAM roles.

Setting Up Boto3 Redshift SDK

Boto3 has an extensive API reference for Redshift. As you can see from the article, the API reference contains the low-level Client API, Paginators, and Waiters. We will not go over all of the possible Client Methods, Paginators, and Waiters in this tutorial. But we’ll cover just enough to get you started with the boto3 API reference for Redshift and make you feel at ease.

Getting Started with Boto3 Redshift SDK

It all starts with creating a client. It is quite simple. See the code snippet below:

import boto3
client = boto3.client('redshift')

While a statement like the above will work on lambda, you may need to provide additional arguments on other platforms.

client = boto3.client('redshift', region_name = 'us-east-1', aws_access_key_id = KEY, aws_secret_access_key = SECRET)

Next, you need to make sure that you have the ID of the cluster of your interest. If you are unsure, go to the Redshift Console, click on ‘Clusters’ from the left menu, and scroll down to find the list of clusters. The name of the cluster is the cluster-ID.

Create the Policy and Attach it to the Lambda Role

In this tutorial, we will make the boto3 Redshift API calls using a lambda function. You can modify/skip this part if you plan to make the calls through some other platform.

Go to the IAM console, and create a policy. Now, to get started very quickly, you can skip this step and simply use the AWS-managed AmazonRedshiftFullAccess policy. However, you may want to give some restricted Redshift access to your lambda, in which case, you should create your own policy. You can use the visual editor or edit the JSON directly. For this tutorial, to keep things simple, we will provide lambda the read and list permissions to all Redshift resources.

The JSON version of the policy looks like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeHsmConfigurations",
                "redshift:ListDatabases",
                "redshift:DescribeClusterTracks",
                "redshift:DescribeEvents",
                "redshift:DescribeUsageLimits",
                "redshift:DescribeDefaultClusterParameters",
                "redshift:DescribeEventCategories",
                "redshift:DescribeClusterSubnetGroups",
                "redshift:DescribeQuery",
                "redshift:DescribeReservedNodeOfferings",
                "redshift:DescribeSnapshotSchedules",
                "redshift:DescribeDataSharesForProducer",
                "redshift:DescribeClusterVersions",
                "redshift:DescribeClusterSnapshots",
                "redshift:DescribeSavedQueries",
                "redshift:DescribeLoggingStatus",
                "redshift:DescribeStorage",
                "redshift:DescribeTableRestoreStatus",
                "redshift:DescribeClusterParameters",
                "redshift:DescribeClusterSecurityGroups",
                "redshift:DescribeEventSubscriptions",
                "redshift:DescribeOrderableClusterOptions",
                "redshift:DescribeHsmClientCertificates",
                "redshift:DescribeAccountAttributes",
                "redshift:ViewQueriesInConsole",
                "redshift:DescribeScheduledActions",
                "redshift:DescribeTags",
                "redshift:DescribeClusterParameterGroups",
                "redshift:DescribeClusterDbRevisions",
                "redshift:FetchResults",
                "redshift:ViewQueriesFromConsole",
                "redshift:ListTables",
                "redshift:ListSchemas",
                "redshift:DescribeSnapshotCopyGrants",
                "redshift:DescribeDataSharesForConsumer",
                "redshift:DescribeAuthenticationProfiles",
                "redshift:DescribeReservedNodes",
                "redshift:DescribeNodeConfigurationOptions",
                "redshift:ListSavedQueries",
                "redshift:DescribeClusters",
                "redshift:DescribeDataShares",
                "redshift:DescribeResize",
                "redshift:GetReservedNodeExchangeOfferings",
                "redshift:DescribeTable"
            ],
            "Resource": "*"
        }
    ]
}

You can restrict the policy to specific resources (recommended), and also provide write permissions if required.

Once the IAM policy is created, attach it to the IAM role of your lambda function. Even if you are using any AWS-managed policy instead of creating your own, don’t forget to attach it to the IAM role of your lambda function.

Make your First Boto3 Redshift API Call

Let’s call the describe_clusters() operation. It will, as the name suggests, describe the clusters for us.

Replace the lambda code with the following:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    client = boto3.client('redshift')
    response = client.describe_clusters(
        MaxRecords=30
    )
    print(response)
    return json.loads(json.dumps(response, indent=4, sort_keys=False, default=str))

Add the name of your cluster-ID in place of ‘your-cluster-id’. Deploy it and trigger the lambda. You should see a response like the one below:

{
  "Clusters": [
    {
      "ClusterIdentifier": "redshift-cluster-1",
      "NodeType": "dc2.large",
      "ClusterStatus": "available",
      "ClusterAvailabilityStatus": "Available",
      "MasterUsername": "awsadminuser",
      "DBName": "dev",
      "Endpoint": {
        "Address": "redshift-cluster-1.abcdefgh.your_region.redshift.amazonaws.com",
        "Port": 5439
      },
      "ClusterCreateTime": "2021-12-08 16:46:06.078000+00:00",
      "AutomatedSnapshotRetentionPeriod": 1,
      "ManualSnapshotRetentionPeriod": -1,
      "ClusterSecurityGroups": [],
      "VpcSecurityGroups": [
        {
          "VpcSecurityGroupId": "sg-0dbfe08092eccxyzw",
          "Status": "active"
        }
      ],
      "ClusterParameterGroups": [
        {
          "ParameterGroupName": "default.redshift-1.0",
          "ParameterApplyStatus": "in-sync"
        }
      ],
      "ClusterSubnetGroupName": "default",
      "VpcId": "vpc-02ed9c6d864dwwwwww",
      "AvailabilityZone": "us-east-2c",
      "PreferredMaintenanceWindow": "thu:05:30-thu:06:00",
      "PendingModifiedValues": {},
      "ClusterVersion": "1.0",
      "AllowVersionUpgrade": true,
      "NumberOfNodes": 1,
      "PubliclyAccessible": false,
      "Encrypted": false,
      "ClusterPublicKey": "ssh-rsa public_key Amazon-Redshiftn",
      "ClusterNodes": [
        {
          "NodeRole": "SHARED",
          "PrivateIPAddress": "xxx.xx.xx.xxx",
          "PublicIPAddress": "yy.yyy.yyy.yy"
        }
      ],
      "ClusterRevisionNumber": "33904",
      "Tags": [],
      "EnhancedVpcRouting": false,
      "IamRoles": [],
      "MaintenanceTrackName": "current",
      "DeferredMaintenanceWindows": [],
      "NextMaintenanceWindowStartTime": "2021-12-16 05:30:00+00:00",
      "AvailabilityZoneRelocationStatus": "disabled",
      "ClusterNamespaceArn": "arn:aws:redshift:your_region:your_account:namespace:aae9c14b-e6f5-427f-9b03-d76d1wwwwwwww",
      "TotalStorageCapacityInMegaBytes": 400000,
      "AquaConfiguration": {
        "AquaStatus": "disabled",
        "AquaConfigurationStatus": "auto"
      }
    }
  ],
  "ResponseMetadata": {
    "RequestId": "e44461b7-46e1-400e-ad3c-bfa7ba410377",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "e44461b7-46e1-400e-ad3c-bfa7ba410377",
      "content-type": "text/xml",
      "content-length": "3803",
      "vary": "accept-encoding",
      "date": "Tue, 14 Dec 2021 17:05:55 GMT"
    },
    "RetryAttempts": 0
  }
}

As you can see, all the details pertaining to your clusters (parameter group, namespace, endpoint, etc.) are returned. I had just one cluster in Redshift, so the ‘Clusters’ array had only one item.

In the API reference, the output can be restricted using either the ClusterIdentifier argument (in which case only the description of that specific cluster is returned) or using the TagKeys or TagValues argument (in which case only the clusters having matching keys or values will be returned).

Boto3 Redshift API Call Example with Pagination

We’ve asked for a maximum of 30 records in the above API call. However, if there are more clusters to be returned, you can use pagination. The corresponding paginator for this method can be found here.

A paginated version of the above query would look like this:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    responses = []
    client = boto3.client('redshift')
    paginator = client.get_paginator('describe_clusters')
    response = paginator.paginate(
            PaginationConfig={
                'MaxItems': 20,
                'PageSize': 20,
            }
        ).build_full_result()
    responses.append(response)
        
    while ('NextToken' in response):
        response = paginator.paginate(
            PaginationConfig={
                'MaxItems': 20,
                'PageSize': 20,
                'StartingToken': response['NextToken']
            }
        ).build_full_result()
        responses.append(response)
    return json.loads(json.dumps(responses, indent=4, sort_keys=False, default=str))

As you can see, the paginator is run initially without a starting token. We then check if the response contains the ‘NextToken’ key. If it does, that means that there are more results yet to be fetched. Therefore, we run the paginator again, till there is no ‘NextToken’ key in the response.

If you just have a single cluster, and yet want to see the pagination in action, you can add more than 20 tags to that cluster, and then run pagination on describe_tags.

Boto3 Redshift API Call Example with Waiter

As described in the introduction, a waiter polls an API till a successful response is received. For Redshift, four waiters are available currently in boto3:

ClusterAvailable,
ClusterDeleted,
ClusterRestored, and
SnapshotAvailable.

Let us consider the ClusterAvailable w aiter. This polls the describe_clusters() method every 60 seconds until it receives a success. It returns an error after 30 failed checks. This is useful for waiting till the cluster becomes available and then performing further actions. The lambda code for the waiter will look like this:

import json
import boto3

def lambda_handler(event, context):
    # TODO implement
    client = boto3.client('redshift')
    waiter = client.get_waiter('cluster_available')
    
    waiter.wait(
        ClusterIdentifier='redshift-cluster-1',
        MaxRecords=20,
        WaiterConfig={
            'Delay': 30,
            'MaxAttempts': 5
        }
    )
        
    response = client.describe_clusters()

    #perform other actions
    
    return json.loads(json.dumps(response, indent=4, sort_keys=False, default=str))

As you can see, the waiter doesn’t return anything. It just introduces a delay till the required condition is satisfied (the availability of the ‘redshift-cluster-1’ cluster in this case). We can configure the waiter by specifying the delay between successive attempts, and the maximum number of attempts in the WaiterConfig argument.

In case the waiter doesn’t get a successful response even after maximum attempts are exceeded, the lambda function returns a runtime error like the one below:

{
  "errorMessage": "Waiter ClusterAvailable failed: Max attempts exceeded",
  "errorType": "WaiterError",
  "stackTrace": [
    [
      "/var/task/lambda_function.py",
      15,
      "lambda_handler",
      "'MaxAttempts': 2"
    ],
    [
      "/var/runtime/botocore/waiter.py",
      53,
      "wait",
      "Waiter.wait(self, **kwargs)"
    ],
    [
      "/var/runtime/botocore/waiter.py",
      365,
      "wait",
      "last_response=response,"
    ]
  ]
}

Querying Data Using Boto3 Redshift SDK

In case you wish to query data from Redshift, you can refer to the boto3 Redshift API reference for Redshift data. You can know more about using the Amazon Redshift Data API to interact with Amazon Redshift clusters here.

Another non-boto3 option is to get the database credentials, and then run the query using the traditional psycopg2.

Conclusion

In this article, we introduced you to the AWS boto3 Redshift SDK. We saw how to make method calls on the boto3 Redshift client, how to use Paginators, and also Waiters. We also saw the resources to help get started on SQL queries execution using the boto3 Redshift Data Client.

While using AWS Redshift Services are insightful, it is a hectic task to set up and manage the proper environment on a regular basis. Extracting and integrating several heterogeneous sources into your Data Warehouse like Amazon Redshift is also a big task. To make things easier, Hevo comes to your rescue. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built integrations that you can choose from.

Hevo can help you integrate your data from 150+ sources and load them into destinations like Amazon Redshift to analyze real-time data with a BI tool. It will make your life easier and Data Migration hassle-free. It is user-friendly, reliable, and a secure solution for your business needs.

Want to take Hevo for a spin? Sign Up for Hevo’s 14-day free trial and see the difference! You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about the boto3 Redshift SDK in the comments section below.

FAQs

1. What is the limitation of Redshift API:

The Redshift API has rate limits on request frequency, does not support executing SQL queries, and can experience delays for long-running operations like resizing clusters or restoring snapshots.

2. How many queries can Redshift handle?

By default, Redshift can handle 50 concurrent queries, with the ability to scale this using Workload Management (WLM) and Concurrency Scaling to manage query load during spikes.

3. How much data can Redshift handle?

Redshift can scale to handle petabytes of data with RA3 nodes for independent storage scaling, and Redshift Spectrum enables querying unlimited data stored in Amazon S3.

Yash Sanghvi Technical Content Writer, Hevo Data

Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.