Understanding Amazon Redshift Regex: Simplified 101

on Amazon Redshift, Data Cleaning, data management, Data Processing, redshift architecture • September 17th, 2021 • Write for Hevo

Amazon Redshift Regex - Feature Image

Companies stores terabytes of data from multiple data sources into Data Warehouses and Data Lakes. The data exists in different formats and is not ready for analysis. Data Cleaning becomes the first step to make your data more useful. No matter how good a Business Intelligence (BI) tool you have or any powerful Machine Learning model, the raw and unclean data can never deliver you good results. 

Amazon Redshift Regex is a perfect solution to clean data with fewer efforts. It uses regular expressions to extract strings from the data. Amazon Redshift Regex matches the data with a specified regular expression and returns the clean data as output. Data Cleaning is the most time-consuming task to analyze data or preparing it for the Machine Learning model. 

The quality data is directly proportional to the accuracy of any Machine Learning model. Amazon Redshift Regex offers great flexibility to Data Analysts, Data Scientists, and developers to clean the streaming data to Amazon Redshift and Amazon S3. In this article, you will learn about Amazon Redshift Regex, how to use the Regular Expressions in Amazon Redshift to clean data. You will also read about few Amazon Redshift Regex functions.

Table of Contents

Introduction to Amazon Redshift

Redshift Logo
Image Source

Amazon Redshift is a cloud-based serverless Data Warehouse that is a part of AWS (Amazon Web Services). It is a fully managed and cost-effective Data Warehouse solution that can store petabytes of data and perform real-time analysis to generate insights.

Amazon Redshift is a column-oriented Database that stores the data in a columnar format. With its compute engine, Amazon Redshift delivers fast query processing and with fewer resources. 

Amazon Redshift contains a leader node and cluster of compute nodes that perform analytics on data. The below snap depicts the schematics of Amazon Redshift architecture: 

Amazon Redshift offers JDBC (Java Database Connectivity) connectors to interact with client applications using major programming languages like Python, Scala, Java, Ruby, etc.

Key Features of Amazon Redshift

Amazon Redshift allows companies to store their data from multiple sources to a large storage pool. It offers many features to users. A few of them are listed below:

  • Amazon Redshift allows users to write queries and export the data back to Data Lake.
  • Amazon Redshift can seamlessly query the files like CSV, Avro, Parquet, JSON, ORC directly with the help of ANSI SQL.
  • Amazon Redshift has exceptional support for Machine Learning and developers can create, train and deploy Amazon SageMaker models using SQL.
  • Amazon Redshift has an Advanced Query Accelerator (AQUA) which performs the query 10x faster than other Cloud Data Warehouses.
  • Amazon Redshift’s Materialistic view allows you to achieve faster query performance for ETL, batch job processing, and dashboarding.
  • Amazon Redshift has a petabyte scalable architecture and it scales quickly as per need.
  • Amazon Redshift enables secure sharing of the data across Amazon Redshift clusters.
  • Amazon Redshift provides consistently fast performance, even with thousands of concurrent queries.

To learn more about Amazon Redshift, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Introduction to Regular Expressions

Amazon Redshift Regex (Regular Expressions) Cover Image
Image Source

Regular Expressions are the patterns that are bundled together to identify/match the characters in a set of strings. A Regular Expression is commonly called Regex, and it is a pattern that represents a string of characters. It can be used to match the string, not to validate them. 

Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Download the Cheatsheet on How to Set Up High-performance ETL to Redshift
Learn the best practices and considerations for setting up high-performance ETL to Redshift

A typical example of Regular Expression can be seen as finding out the E-Mail addresses from a long list of documents by applying a set of patterns that matches [text]@[text].[text]. Regular Expression or Regex are platform and language agnostic, and the same Regex pattern can be used in any Operating System and any coding language. Regex table that contains the detail about expressions and what it matches is given below:

Sub-expressionMatches
^It matches the beginning of the line.
$It matches the end of the line.
.It matches any single character, it can be a digit, or alphabet, or symbols
[…]It matches any single character in brackets.
[^…]It matches any single character, not in brackets
ABeginning of entire string
zEnd of the entire string
ZIt matches the end of the string just before the newline.
re*It matches 0 or more occurrences of the preceding expression.
re+It matches 1 or more of the previous expression
re?Matches 0 or 1 occurrence of the preceding expression.
re{ n}It matches exactly n number of occurrences of the preceding expression.
re{ n,}It matches n or more occurrences of the preceding expression.
re{ n, m}It matches at least n and at most m occurrences of the preceding expression.
a|bIt matches either a or b.
(re)It groups regular expressions and remembers matched text.
(?: re)It groups regular expressions without remembering matched text.
(?> re)It matches independent patterns without backtracking.
wIt matches word characters.
WIt matches non-word characters.
sIt matches whitespace. Equivalent to [tnrf].
SIt matches non-whitespace.
dIt matches digits. Equivalent to [0-9].
DIt matches non-digits.
n, t, etc.It matches newlines, carriage returns, tabs, etc.
QEscape (quote) all characters up to E

4 Key Amazon Redshift Regex Functions

Amazon Redshift Regex Functions
Image Source

Now that we have a good understanding of Regular Expressions. Let’s have a look at the popular Amazon Redshift Regex functions. The 4 key Regex functions are listed below:

1) Amazon Redshift Regex: REGEXP_Count

Regexp_count is an Amazon Redshift Regex function that searches for a string corresponding to the Regex pattern and returns the number of times the pattern occurs in the string. A count of 0 indicates no matching string found in the given text.
Syntax:

REGEXP_COUNT(source_string, pattern [, position [, parameters]])

Where, 

  • source_string: A string on which the patterns need to be matched.
  • pattern – A Regex pattern.
  • position [Optional]: Specifies the position in source_string to start searching. Default is the start of the string.
  • parameters [Optional]: An optional parameter that indicates how to match the pattern. The possible values are listed below:
    • c: It performs case-sensitive matching.
    • i: It performs case-insensitive matching.
    • p: For Perl compatible Regular Expression.

Some examples for Regexp_count are given below:

  • The following example counts the occurrences of ‘Fox’ in the string case-insensitive way.
SELECT regexp_count(‘Quick Brown Fox Jumps over Lazy Dog’, ‘FOX’, ‘i’);

regexp_count
-------------
           1
  • The following example searches and provide the number of occurrences where the domain is either ‘org‘ or ‘edu‘.
SELECT email, regexp_count(email,'.*@.*.(edu|org)')FROM users;

 email                    	        | regexp_count
--------------------------------------------------------
 alex.wrag@uog.edu 			|            1
 kristie.doug@ufs.org    	        |            1
 amy.marshall@uoc.ca    		|            0
 sed@uoc.ca             		|            0

2) Amazon Redshift Regex: REGEXP_Replace

Regexp_replace is another Amazon Redshift Regex function that searches for a pattern in a string and replaces every occurrence with the specified replacement string provided.
Syntax:

REGEXP_REPLACE(source_string,pattern[,replace_string[, position[,parameter]]

Where,

  • source_string: A string on which the patterns need to be matched.
  • pattern: A Regex pattern.
  • replace_string: A string or column name that will replace each occurrence of pattern.
  • position:  Specifies the position in source_string to start searching. Default is the start of the string.
  • parameters[Optional]: An optional parameter that indicates how to match the pattern. The possible values are:
    • c: It performs case-sensitive matching.
    • i: It performs case-insensitive matching.
    • p: For Perl compatible Regular Expression.

An example for Regexp_replace is given below:

The following example replaces all occurrences of the string “FOX” with the value “quick brown fox“, using case-insensitive matching.

SELECT regexp_replace('the fox', 'FOX', 'quick brown fox', 1, 'i');

regexp_replace
---------------------
the quick brown fox

3) Amazon Redshift Regex: REGEXP_SUBSTR

Regexp_substr is another Amazon Redshift Regex function that searches for the regex pattern in the provided string/column and extracts the matched characters.
Syntax:

REGEXP_SUBSTR ( source_string, pattern [, position [, occurrence [, parameters]]]

Where, 

  • source_string: A string on which the patterns need to be matched.
  • pattern: A Regex pattern.
  • position: It specifies the position in source_string to start searching. Default is the start of the string.
  • occurrence: It specifies how many occurrences to match within the string. Default is 1.
  • parameters[Optional]: An optional parameter that indicates how to match the pattern. The possible values are:
    • c: It performs case-sensitive matching.
    • i: It performs case-insensitive matching.
    • e: It allows sub-expression. 
    • p: For Perl compatible Regular Expression.

An example for Regexp_substr is given below:

The following example returns the portion of an E-Mail address between the ‘@‘ character and the domain extension.

SELECT email, regexp_substr(email,'@[^.]*') FROM users;

 email                     	|  regexp_substr
--------------------------------------------
 alex.wrag@uog.edu 		|  @uog
 kristie.doug@ufs.org    	|  @ufs 
 amy.marshall@uoc.ca    	|  @uoc
 sed@uoc.ca             	|  @uoc

4) Amazon Redshift Regex: REGEXP_INSTR

Regexp_instr is another function of Amazon Redshift Regex and it searches for regex patterns and returns the starting or ending position of the character that matches the pattern.
Syntax:

REGEXP_INSTR ( source_string, pattern [, position [, occurrence] [, option [, parameters ] ] ] ] )

Where, 

  • source_string: A string on which the patterns need to be matched.
  • pattern: A Regex pattern.
  • position:  Specifies the position in source_string to start searching. Default is the start of the string.
  • occurrence – specifies how many occurrences to match within the string. Default is 1.
  • option: Option to either return the starting position(0) of character or ending position(1). The default value is 0. 
  • parameters[Optional] – An optional parameter that indicates how to match the pattern. The possible values are:
    • c: It performs case-sensitive matching.
    • i: It performs case-insensitive matching.
    • e: allows sub-expression.
    • p: For Perl compatible Regular Expression.

An example for Regexp_instr is given below:

The following example searches for the ‘@‘ character that begins a domain name and returns the starting position of the first match.

SELECT email, regexp_instr(email, '@[^.]*') FROM users;

email                     	|  regexp_instr
--------------------------------------------
 alex.wrag@uog.edu 		|  10
 kristie.doug@ufs.org    	|  13 
 amy.marshall@uoc.ca    	|  13
 sed@uoc.ca             	|  4

Pattern Matching Operators in Amazon Redshift Regex

You have read how to use Regular Expressions in the Amazon Redshift Regex. Apart from the Amazon Redshift Regex function, there are several pattern matching operators are available in Amazon Redshift Regex that performs the wildcard matches and returns the result that matches with the pattern. In this section, let’s discuss the key pattern matching operator provided by Amazon Redshift Regex. The 3 Regex operators are listed below:

1) LIKE

The LIKE operator in Amazon Redshift Regex is used to match the string with the wildcard patterns such as %(percent) and _(Underscore).  LIKE is case-sensitive and ILIKE is case-insensitive.

Syntax :
expression [ NOT ] LIKE | ILIKE pattern [ ESCAPE ‘escape_char’]
Where

  • expression: Often a column name.
  • LIKE | ILIKE: A pattern matching operator.
  • pattern: A pattern to be matched.
  • escape [Optional]: An optional escape_char in case the wildcard is a part of a string to be matched.

LIKE supports the following pattern-matching metacharacters given below:

OperatorDescription
%It matches any sequence of zero or more characters.
_It matches any single character.

An example for the LIKE operator is given below:

select distinct city from emp where city like 'E%' order by city;

city
---------------
East Glasgow
Eastern South City
Easthampton

2) SIMILAR TO

The SIMILAR TO operator in Amazon Redshift Regex matches a string expression often column name with the regex pattern. The list of metacharacters supported by the SIMILAR TO operation are listed in the below table: 

OperatorDescription
%It matches any sequence of zero or more characters.
_It matches any single character.
|It is used for alternative matches, either this or that
*It repeats the previous item zero or more times.
+It repeats the previous item one or more times.
?It repeats the previous item zero or one time.
{m}It repeats the previous item exactly m times.
{m,}It repeats the previous item m or more times.
{m,n}It repeats the previous item at least m and not more than n times.
()Parentheses group items into a single logical item.

Syntax:
expression [ NOT ] SIMILAR TO pattern [ ESCAPE ‘escape_char’ ]
Where

  • expression: Often a column name.
  • SIMILAR TO: A pattern matching operator.
  • pattern: A pattern to be matched.
  • escape [Optional]: An optional escape_char in case the wildcard is a part of a string to be matched.

An example for the SIMILAR TO operator is given below:

The following example finds cities whose names contain “E” or “H”:

SELECT DISTINCT city FROM emp
WHERE city SIMILAR TO '%E%|%H%';

 city
-----------------
 Malabar Hills
 Houston
 Easthampton
 Beverly Hills
 Eastern Heights

3) POSIX

The POSIX is another pattern matching Amazon Redshift Regex operator and is more powerful than LIKE and SIMILAR TO operators. POSIX operator can be used to match the pattern anywhere in the string. POSIX is a computationally expensive operation and results in slow results when processing a huge number of rows.

Syntax: 
expression [ ! ] ~ pattern
Where

  • Expression: Often a column name.
  • ! : Negation operator.
  • ~ : Perform a case-sensitive match for any substring of expression.
  • pattern: A string literal that represents a SQL standard Regular Expression pattern.

In addition to the Amazon Redshift Regex pattern listed in the earlier section, the POSIX operator on Amazon Redshift supports the following character classes given below:

Character ClassDescription
[[:alnum:]]All ASCII alphanumeric characters.
[[:alpha:]]All ASCII alphabetic characters.
[[:blank:]]All blank space characters.
[[:cntrl:]]All control characters (nonprinting).
[[:digit:]]All numeric digits.
[[:lower:]]All lowercase ASCII alphabetic characters.
[[:punct:]]All punctuation characters.
[[:space:]]All space characters (nonprinting).
[[:upper:]]All uppercase ASCII alphabetic characters.
[[:xdigit:]]All valid hexadecimal characters.

The following example finds cities whose names contain E or H:

SELECT DISTINCT city FROM emp
WHERE city ~ '.*E.*|.*H.*' ORDER BY city LIMIT 5;

city
-----------------
 Malabar Hills
 Houston
 Easthampton
 Beverly Hills
 Eastern Heights

Conclusion

In this article, you learned about Amazon Redshift Regex in detail and how to apply those Regular Expressions to clean data. Also, you read about various Amazon Redshift Regex functions and operators that come in handy to perform Regular Expressions on strings. Data Analysts and Data Scientists can save a lot of time by cleaning data using Amazon Redshift Regex. 

Visit our Website to Explore Hevo

Companies use multiple platforms for their daily tasks and transfer data from these platforms to Amazon Redshift to run analysis and get insights from it. Hevo Data is a No-code Data pipeline solution that can help you transfer data from 100+ sources to Amazon Redshift or other Data Warehouse of your choice. Its fault-tolerant and user-friendly architecture fully automates the process of loading and transforming data to destination without writing a single line of code. 

Want to take Hevo for a spin?

Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Amazon Redshift Regex in the comments section below!

No-code Data Pipeline For your Amazon Redshift