Amazon Redshift IN Condition: Delivered With 6 Easy Examples
Companies use Data Warehouses to run their business operations efficiently and implement data-driven strategies. Amazon Redshift is a widely used Data Warehousing solution that helps companies store and analyze data. Its the job of Data Analysts, Data Scientists, and Business Analyst to get the data out of the large datasets and analyze them separately with the help of Business Intelligence tools.
Table of Contents
Amazon Redshift offers high query performance and allows users to easily extract data from the Data Warehouse. There is a continuous need to extract small chunks of data from large datasets for various business activities, analysis, etc. Amazon Redshift supports SQL for querying data from the Data Warehouse that making the job easier for SQL users. Amazon Redshift IN Condition is one of the widely used patterns matching conditions that Analysts use while querying data.
Amazon Redshift IN Condition tests a value for membership in a set of values or in a subquery. Many times Amazon Redshift IN Condition is used for optimizing the query. In this article, you will be touching upon one specific aspect of the Amazon Redshift IN Condition. Also, you go through a few examples of Amazon Redshift IN Condition for better understanding.
Table of Contents
- Introduction to Amazon Redshift
- Understanding Amazon Redshift IN Condition
- Examples of Amazon Redshift IN Condition
Introduction to Amazon Redshift
Amazon Redshift is the Data Warehousing solution offered by Amazon. It can scale to petabytes of data. Like most other Data Warehouses, Amazon Redshift organizes data in columns instead of rows, to allow parallel processing of data. However, as end-users, you need not worry about the internal implementation of the parallel data query. You can use the good old SQL to work with data in Amazon Redshift.
Amazon Redshift uses Columnar Data Storage and Massive Parallel Processing (MPP) to generate query responses quickly. There is another feature called Concurrent Scaling that facilitates quick query response and offers extra clusters for querying data that is larger than the main cluster’s capacity.
Key Features of Amazon Redshift
Amazon Redshift help companies store and analyze data to make data-driven decisions. Most companies use Amazon Redshift due to its plethora of features. A few features of Amazon Redshift are listed below:
- Intelligent Optimization: There are several ways to query a large dataset and you can get some information. Amazon Redshift offers tools to improve the queries and optimize them.
- Massively Parallel Processing: Amazon Redshift achieves high performance by parallel running queries in different clusters and distributes workload on other processors which makes the process faster.
- Redshift ML: Amazon Redshift comes with a pre-built environment that leverages Machine Learning that helps Data Analysts and Data Scientists to create, train and deploy Amazon SageMaker models using SQL commands.
- Fault-Tolerant: AWS continuously monitors the health of the Amazon Redshift clusters and automatically replicates data regularly to avoid any data loss at the time of disaster.
To learn more about Amazon Redshift, click here.
Simplify Data Analysis with Hevo’s No-code Data Pipeline
Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.Get Started with Hevo for Free
Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.
Check out why Hevo is the Best:
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
- Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
- Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
- Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Familiarity with SQL will help. If you wish to implement the examples shown in this article, then you’ll need an AWS account with an Amazon Redshift cluster provisioned. If you have a new account then, you are eligible for the 2 months free trial for accessing DC2.Large node. For details click here.
Understanding Amazon Redshift IN Condition
The Amazon Redshift IN Condition is very straightforward and you should feel at home with it if you have used it in the past with other databases like MySQL, PostgreSQL, etc. The syntax is variable [NOT] IN (list/ subquery)
Over here, the variable can be of one of the following types:
- Character (varchar, text, etc.)
Of course, the values in the list/subquery datatype should be compatible with the data type of the variable.
The list of values can be comma-separated and bounded by parentheses. Alternatively, you can have a subquery (SELECT ….) in parentheses.
As you would expect, this condition is generally used in the WHERE clause and is used to filter out rows in the final output, which lie IN (or NOT IN) a list or the output of a subquery.
Special Amazon Redshift Considerations
If the list includes more than 10 values, it is internally evaluated as a scalar array, whereas for fewer than 10 values, the list is evaluated as a series of OR conditions. This can be better understood with an example.
You will consider the user’s table from the public dataset made available to you when you first create an Amazon Redshift cluster. You will use EXPLAIN to see how Amazon Redshift plans to execute the query. Let us filter users by the firstname.
Less than 10 values in the list
EXPLAIN SELECT * from users WHERE firstname IN ('Rafael', 'Lars', 'Victor', 'Colton')
The output is:
As you can see, it has considered the Amazon Redshift IN Condition as a series of four = conditions, joined by OR.
Greater than 10 items in the list
Now let’s try with more than 10 items on the list
EXPLAIN SELECT * from users WHERE firstname IN ('Rafael', 'Lars', 'Victor', 'Colton','Barry','Reagan','Naida','Scarlett','Amal','Hermione','Helen')
The output is
As you can see, now the list is evaluated as a scalar array.
Amazon Redshift does this for internal performance optimization. While you generally need not worry about how Amazon Redshift handles this internally, if you ever want to make Amazon Redshift forcibly use OR series, instead of scalar array, you can try some approaches specified here.
Examples of Amazon Redshift IN Condition
Now let’s consider some examples of using the Amazon Redshift IN Condition. For the initial few examples, we’ll consider the date table that is available as a sample table provided at the time of creating the cluster.
Let’s just inspect the table first.
SELECT * from date LIMIT 10
The output is shown below.
As you can see, it has one date column (caldate), a couple of text columns (day, month), one boolean column (holiday), and other numeric columns.
Let’s see what some example queries on this table would look like.
1) Get all rows falling on a weekday
SELECT * from date WHERE day IN ('MO','TU','WE','TH','FR')
SELECT * from date WHERE day NOT IN ('SA','SU')
2) Get rows falling in weeks 3 to 5
SELECT * from date WHERE week IN (3,4,5)
3) Get rows falling in January to March months
SELECT * from date WHERE month IN ('JAN','FEB','MAR')
Alternatively, you can also extract the month field from the date column and apply Amazon Redshift IN conditions to it.
SELECT * from date WHERE datepart(month,caldate) IN (1,2,3)
4) Get all rows except those falling in solstices of 2008
SELECT * from date WHERE caldate NOT IN ('2008-12-21','2008-06-21')
Now let’s consider some examples with a subquery in the parentheses instead of a list of values. You will consider the sales and listing tables available in the sample dataset.
5) Get all listed items that were also sold
SELECT * from listing WHERE listid IN (SELECT listid from sales)
As you can see, here you are using a subquery with the Amazon Redshift IN Condition. You are fetching all the listids from the sales table, and seeing their listings for the same.
6) Get all items sold, which were listed in between 06:00 and 08:00 on 1st January 2008
SELECT * from sales WHERE listid IN (SELECT listid from listing WHERE listtime >= '2008-01-01 06:00:00' and listtime <= '2008-01-01 08:00:00')
In this article, you learnt how to use the Amazon Redshift IN Condition, through several examples, covering different data types. You also saw examples of using the Amazon Redshift IN Condition with subqueries. You understood how Amazon Redshift internally optimizes the query plan depending on the number of values in the list.Visit our Website to Explore Hevo
Amazon Redshift stores data from multiple sources and every source follows a different schema. Instead of manually writing scripts for every source to perform the ETL (Extract Transform Load) process, one can automate the whole process. Hevo Data is a No-code Data pipeline solution that can help you transfer data from 100+ sources to Amazon Redshift or other Data Warehouse of your choice. Its fault-tolerant and user-friendly architecture fully automates the process of loading and transforming data to destination without writing a single line of code.
Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of learning about the Amazon Redshift IN Condition in the comments section below!