BigQuery Regex and Pattern Matching: 6 Major Things

on Data Warehouse, Google BigQuery, SQL • January 20th, 2022

BigQuery Regex

Companies latch new businesses by analyzing their rising data, but traditional data warehouses scramble to deliver consistent performance. However, Google rolled out an enterprise serverless data warehouse, BigQuery that enables Data Analysis at scale to suit businesses of any size. As BigQuery supports structured and semi-structured data, it allows analysts to use standard Structured Query Language (SQL) and BigQuery Regex to effectively work with Big Data.

This article gives an overview of the BigQuery Regex and pattern matching. It introduces you to BigQuery and the fundamentals of regular expression in SQL. Moreover, it also describes key functions, syntax, and example queries of Regex functions.

Table of Contents

Prerequisite

  • Understanding of different data types, including string and floating-point

What is BigQuery

BigQuery Regex - BigQuery logo
Image Source

BigQuery is Google Cloud’s fully managed, serverless, petabyte-scale data warehouse solution. Since its inception in 2012, it has provided cost-effective and real-time analytics on a vast amount of data that helps businesses to gain agility. With BigQuery, analysts get the support of ANSI-SQL at the backend, which helps users to focus more on finding meaningful insights swiftly.

In addition, BigQuery facilitates built-in features like Machine Learning, Geospatial Analysis, and Business Intelligence for organizations to manage and analyze data effectively. For further information on Google BigQuery, check out the official website here.

Introduction to Regex

BigQuery Regex - Regex logo
Image Source

Originated in 1951, Regex, abbreviated as the regular expression, is a sequence of characters used for pattern matching with strings or string matching. As regular expressions are used in every programming language like Java, Python, and SQL, it is helpful to match patterns with a sequence of characters. To get a broader view, the importance and understanding of Regex are as discussed below.

What is BigQuery Regex?

BigQuery is one of the most popular Cloud Data Warehouses used by millions of people all over the world for storage and analysis. In BigQuery Analysis, there are a lot of cases where one would want to extract certain parts of a string, validate the format of a string, and even replace or remove certain characters. This is where BigQuery Regex comes into the picture. As discussed, regular expressions are a pattern or a sequence of characters used for pattern matching. It helps you perform various string matching operations in BigQuery.

Importance of Regex in Programming

Many programming languages provide Regex either built-in or via libraries that have wide applications in Search Engines, URL Matching, and Lexical Analysis. Below are a few essential aspects of Regex:

  • Pattern Matching: Regular expressions are useful in matching partial or full patterns on a string. These patterns are either enclosed between single quotes or embedded in a constructor function. 
  • Ease of Usage: A regex pattern typically consists of simple characters or a combination of simple and special characters. As literals (character) form the most basic building block, most characters in Regex match a string effortlessly.
  • User-defined: As Regex supports a number of metacharacters with no specific hierarchy, it is user-friendly.

Understanding Regex and Pattern Matching

A regular expression is composed of various elements that become the building blocks for constructing patterns. To understand Regex and Pattern Matching, consider a few components mentioned below:

  • Simple Patterns: It consists of patterns that help find a direct match of characters. These patterns may also include combinations of strings or literals but should be in an exact sequence.
  • Special Patterns: If a search requires more than a direct match, it accompanies a combination of many symbols, characters, and numbers arranged to achieve desired results. Below are a few elements:
  • Repeaters: consists of symbols that direct a program to repeat a preceding character more than once. It comprises of  ‘*,’ ‘+,’ {}‘ to repeat (up to infinite times) a particular character in a sequence.
  • Wildcard: consists of a dot symbol that can take the place of any symbol, and hence called wildcard character. It can be used to match any single character (letter, digits, and white space).
  • Symbols: Regex utilizes many symbols to compute a pattern. Following are a few symbols used:
    • Optional character (?) tells the computer that the preceding character may or may not be present in the matching string.
    • Caret symbol (^) helps to set the position for matching at the beginning of a string or line.
    • Dollar symbol ($) assists the computer in setting the position for matching at the end of a string or line.
  • Character class: matches a set of characters to match the most basic elements of language. Below are a few character classes:
/smatches any whitespace characters like space and tab
/S :matches any non-whitespace characters
/d :matches any digit character
/Dmatches any non-digit characters
/wmatches any word character (alpha-numeric)
/W :matches any non-word character
  • Escape: is used to search any special character as a literal. You can escape using a backslash ‘’ in front of strings. This informs the computer to treat the following character as a search character and consider it for pattern matching.
  • Grouping: is done to combine a set of characters to behave as a single unit. Regex allows the grouping of elements by wrapping a parenthesis ‘()’ around the expression. Users can also match one or more elements by a vertical bar ‘|.’

Regex in SQL

SQL supports Pattern Matching operations based on the RegexP operator. It helps implement pattern search using a query in a database. Below are the patterns used in SQL:

PatternWhat pattern matches
[abc]Any character listed between square brackets
*Repetition of preceding string from zero to infinite times
+Repetition of preceding string from one to infinite times
.A single character
?Match a zero-or-one preceding string
$Matches end of a string
^Matches beginning of a string
[^abc]Any character not listed between square brackets
[a-z]Matches any lowercase letter
[ABC]Matches any uppercase letter
[0-9]Matches any digit from 0 to 9
[[:<:]]Matches the beginning of words
[[:>:]]Matches the end of words
a1 | a2 | a3Matches any of the patterns a1, a2, or a3

Simplify your Data Analysis using Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate data from 100+ Data Sources (including 40+ Free Data Sources) and will let you directly load data to a Data Warehouse or the destination of your choice like BigQuery. It will automate your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.

Get Started with Hevo for free

Let’s look at some of the salient features of Hevo:

  • Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100’s of sources that can help you scale your data infrastructure as required.
  • Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within Data Pipelines.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day Free Trial!

6 BigQuery Regex Functions

As BigQuery uses SQL at the backend, it allows users to match, search and replace or validate a string input. Analysis often requires you to extract only certain parts of a string (validate whether the string has a specific format) in order to replace or remove certain characters. Below are a few default Regex functions that can be used in BigQuery:

1) BigQuery Regex: RegexP_CONTAINS

RegexP_CONTAINS comprises two inputs and returns TRUE (BOOL type) even if the value is a partial match for the regular expression. If the regular expression argument is invalid, the function returns an error. To search for a full match, you can use ^ (caret symbol) at the beginning of a text and $ (dollar symbol) at the end of a text.

2) BigQuery Regex: RegexP_EXTRACT

RegexP_EXTRACT considers two inputs that return a substring matching a regular expression. If the regular expression contains a capturing group (matching pattern), the function will return a substring that matches the capturing group. However, if the expression does not contain a capturing group, the function would return the entire matching string. If the expression doesn’t find any match, it returns NULL. RegexP_EXTRACT returns an error if:

  • The regular expression is invalid.
  • The regular expression has more than one capturing group.
  • The position is not a positive integer.
  • The occurrence is not a positive integer.

3) BigQuery Regex: RegexP_EXTRACT_ALL

RegexP_EXTRACT_ALL returns an array of all substrings (non-overlapping) of value that matches regular expressions.

4) BigQuery Regex: RegexP_INSTR

RegexP_INSTR considers source_value and regular expression, which returns the location (1-based index) of regular expression in a string. Both the inputs must be of the same type (STRING or BYTES) or else it would return an error if the Regex is invalid or has more than one capturing group. It produces zero if no value is found or the regular expression is empty.

5) BigQuery Regex: RegexP_REPLACE

RegexP_REPLACE takes three inputs, value, regular expression, and the replacement argument. It returns a string where all substrings that match regular expressions are replaced with replacement arguments.

You can also implement a Regex pattern, having escaped digits (1 to 9) within the replacement argument to insert text matching a corresponding parenthesis group. A RegexP_REPLACE function only replaces non-overlapping matches, and if the argument is not a valid expression, the function will return an error.

6) BigQuery Regex: RegexP_SUBSTR

RegexP_SUBSTR is a synonym of RegexP_EXTRACT, and it returns STRING or BYTES data type.

BigQuery Regex Syntax and Example Queries

Regex can be applied to a string by using the letter ‘r’ followed by single quotes having the entire regular expression pattern. To have a better understanding of calling Regex functions, follow the below syntax and example queries:

1) RegexP_CONTAINS

To apply RegexP_CONTAINS, follow the below syntax and example queries:

Syntax

BigQuery Regex - S1
Image Source

CASE I: For instance, if the task is to find a valid email address, you can find it using the below query:

BigQuery Regex - O11
Image Source

CASE II: Form the above example, if the task is to perform a full match, you can use caret (^) and dollar ($) symbols at the beginning and end of the string as shown in the below query:

BigQuery Regex - O12
Image Source

2) RegexP_EXTRACT

The REGEXP_EXTRACT regex function also considers position and occurrence  of string to specify the search start (default starts from beginning) and return specific occurrences in value, respectively. The position must be a positive integer (0) and returns NULL if it is greater than the length of the value. The occurrence parameter either returns specific occurrences of Regex in values, or it returns the first match. Below is the syntax and example query to gain a better understanding.

Syntax

BigQuery Regex - S2
Image Source

CASE I: If the task is to find user_name from an email address that involves extracting the initial string before ‘@’ symbol in an email, you can use ‘^’ as shown in the below query:

BigQuery Regex - O21
Image Source

CASE II: For the same problem, if the task was to extract a domain name from email, you can use ‘$’ at the end of a string, as shown in the below query:

BigQuery Regex - O22
Image Source

CASE III: If the task is to get a detailed list of positions and occurrences of desired patterns, use the below query:

BigQuery Regex - O23
Image Source

3) RegexP_EXTRACT_ALL

To apply RegexP_EXTRACT_ALL, follow the below syntax and example queries:

Syntax

BigQuery Regex - S3
Image Source

If you want to find all strings that have a pattern ‘function(),’ use the below command:

BigQuery Regex - O3
Image Source

4) RegexP_INSTR

The search either starts at the specified position in source_values or begins from the beginning of source_value. If the position is negative, the function searches backward from the end of source_value, with -1 indicating the last character.

Syntax

BigQuery Regex - S4
Image Source

If the occurrence is specified, the search returns the position of Regex in source_value and an index of the first occurrence. An occurrence cannot be zero or negative, and if occurrences are greater than one, the function searches for overlapping occurrences.

There is an optional occurence_position to specify the position with respect to the beginning of occurrence. If the choice is +0, it returns the beginning position of occurrence, whereas +1 returns the first position followed by the end of the occurrence.

CASE I: If your task is to find the position of ‘@’ in any given email address, use the below query:

BigQuery Regex - O41
Image Source

CASE II: For queries that involve more than one occurrence, REGEXP_INSTR returns the overlapping occurrences, as shown in the below query:

BigQuery Regex - O42
Image Source

CASE III: If the task is to show detailed information about position and occurrences of the desired pattern, use the below query:

BigQuery Regex - O43
Image Source

5) RegexP_REPLACE

To apply RegexP_REPLACE, follow the below syntax and example queries:

Syntax

BigQuery Regex - S5
Image Source

Suppose the task is to replace ‘# ’ with <h1> at the beginning and </h1> at the end of a string, use the below query:

BigQuery Regex - O5
Image Source

6) RegexP_SUBSTR

To apply RegexP_SUBSTR, follow the below syntax and example queries:

Syntax

BigQuery Regex - S6
Image Source

For example, if you want to search a string, having ‘H?ello+’ as a regular expression pattern, use the below query:

BigQuery Regex - O6
Image Source 

BigQuery RegExp Tips & Tricks

BigQuery Regex: How to check if a column is a valid phone number

To verify if a column contains a valid phone number or not, you can use REGEXP_CONTAINS function. You can use the following regular expression to match any US formatted number:

^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$

An example query is given below:

SELECT phone_number,
REGEXP_CONTAINS(phone_number, r"^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$") AS is_valid 
FROM`projectname.datasetname.tablename`

BigQuery Regex: How to check if a column does not contain a word

To check if a column does not contain a word, you can use the NOT LIKE operator.

This can be done by simple using the NOT operator. For example, using the regular expression with NOT to the start of the expression helps you check if a column does not contain a word.

SELECT message, 
NOT REGEXP_CONTAINS(message, r".*dolor.*") AS contains_word 
FROM `projectname.datasetname.tablename`

BigQuery RegExp: How to check if a column contains a word from another column

To check if a column contains a word from another column, the REGEXP_CONTAINS function can be used. You will also have to use the CONCAT function to concatenate the information from the other column with the regular expression pattern CONCAT(r'(?i)’, word, r”)

SELECT  message, word, 
REGEXP_CONTAINS(message, CONCAT(r'(?i)', word, r''))
FROM `projectname.datasetname.tablename`

BigQuery RegExp: How to replace a word

To replace a specific word from an entry with another one, you can use the REGEXP_REPLACE function.

SELECT message,
REGEXP_REPLACE(message, r"tellus", "REPLACED") AS replaced_word
FROM `projectname.datasetname.tablename`

BigQuery RegExp: How to split a string

To split a string, you can use the function REGEXP_EXTRACT. This regular expression can extract everything that appears before the first comma in a string.

SELECT message, 
REGEXP_EXTRACT(message, r"^(.+?),") AS extracted
FROM `projectname.datasetname.tablename`

BigQuery RegExp: How to replace special characters

To replace special characters, you can use regular expressions like this [^a-zA-Z0-9]+ and REGEXP_REPLACE function.

SELECT message, 
REGEXP_REPLACE(message, r"[^a-zA-Z0-9]+", "") AS removed_special_characters
FROM `projectname.datasetname.tablename`

BigQuery RegExp: How to extract a number from a string

To extract all numbers from a string, you can use the REGEXP_EXTRACT_ALL function again along with this regular expression ([0-9]+) to extract all numbers that appear in a given address.

SELECT address, 
REGEXP_EXTRACT_ALL(address, r"([0-9]+)") AS numbers
FROM `projectname.datasetname.tablename`

When should I use BigQuery Regex?

Regular expressions perform well in string manipulation and for analysis. Using the above BigQuery Regex and its tips and tricks, you can get started and boost your analysing and data manipulation capabilities very well. This leverages your overall skills in handling the unstrcutured and raw data.

Conclusion

Businesses accompanying traditional data warehouses often suffer performance degradations while querying extensive data. On the other hand, BigQuery provides consistent performance to leverage Data Analysis and manipulation tasks on large datasets.

In addition, with Regex functionality, BigQuery helps analysts to filter strings, and perform scans to identify patterns that match desired validations. In case you want to export data into your desired BigQuery Data Warehouse, then Hevo Data is the right choice for you! 

Visit our Website to Explore Hevo

Hevo Data provides its users with a simpler platform for integrating data from 100+ sources for Analysis. It is a No-code Data Pipeline that can help you combine data from multiple sources. You can use it to transfer data from multiple data sources into your Data Warehouse, Database, or a destination of your choice. It provides you with a consistent and reliable solution to managing data in real-time, ensuring that you always have Analysis-ready data in your desired destination.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your experience of learning about BigQuery Regex Functions! Let us know in the comments section below!

Data Engineering
Survey 2022
Calling all data engineers – fill out this short survey to help us build an industry report for our data engineering community.
TAKE THE SURVEY
Amazon Gift Cards of $25 each are on offer for all valid and complete survey submissions.