Regular expressions are an incredibly powerful tool for manipulating, searching, and transforming text data. Also known as regex, they provide a concise way to match strings of text against specified patterns. Many data professionals, however, find them cryptic and intimidating to work with.

In this post, we will explore BigQuery Regex. We will see how regular expressions work in Google BigQuery, specifically looking at BigQuery’s REGEXP_CONTAINS and REGEXP_EXTRACT functions for matching and extracting patterns from text data using SQL syntax.

What is BigQuery

BigQuery is Google Cloud’s fully managed, serverless, petabyte-scale data warehouse solution. Since its inception in 2012, it has provided cost-effective and real-time analytics on a vast amount of data that helps businesses to gain agility. With BigQuery, analysts get the support of ANSI-SQL at the backend, which helps users to focus more on finding meaningful insights swiftly.

In addition, BigQuery facilitates built-in features like Machine Learning, Geospatial Analysis, and Business Intelligence for organizations to manage and analyze data effectively.

Introduction to Regex

Originated in 1951, Regex, abbreviated as the regular expression, is a sequence of characters used for pattern matching with strings or string matching. As regular expressions are used in every programming language like Java, Python, and SQL, it is helpful to match patterns with a sequence of characters. To get a broader view, the importance and understanding of Regex are as discussed below.

What is BigQuery Regex?

BigQuery is one of the most popular Cloud Data Warehouses used by millions of people all over the world for storage and analysis. In BigQuery Analysis, there are a lot of cases where one would want to extract certain parts of a string, validate the format of a string, and even replace or remove certain characters. This is where BigQuery Regex comes into the picture. As discussed, regular expressions are a pattern or a sequence of characters used for pattern matching. It helps you perform various string-matching operations in BigQuery.

Understanding Regex and Pattern Matching

A regular expression is composed of various elements that become the building blocks for constructing patterns. To understand Regex and Pattern Matching, consider a few components mentioned below:

  • Simple Patterns: It consists of patterns that help find a direct match of characters. These patterns may also include combinations of strings or literals but should be in an exact sequence.
  • Special Patterns: If a search requires more than a direct match, it accompanies a combination of many symbols, characters, and numbers arranged to achieve desired results. Below are a few elements:
  • Repeaters: consists of symbols that direct a program to repeat a preceding character more than once. It comprises of  ‘*,’ ‘+,’ {}‘ to repeat (up to infinite times) a particular character in a sequence.
  • Wildcard: consists of a dot symbol that can take the place of any symbol, and hence called wildcard character. It can be used to match any single character (letter, digits, and white space).
  • Symbols: Regex utilizes many symbols to compute a pattern. Following are a few symbols used:
    • Optional character (?) tells the computer that the preceding character may or may not be present in the matching string.
    • Caret symbol (^) helps to set the position for matching at the beginning of a string or line.
    • Dollar symbol ($) assists the computer in setting the position for matching at the end of a string or line.
  • Character class: matches a set of characters to match the most basic elements of language. Below are a few character classes:
/smatches any whitespace characters like space and tab
/S :matches any non-whitespace characters
/d :matches any digit character
/Dmatches any non-digit characters
/wmatches any word character (alpha-numeric)
/W :matches any non-word character
  • Escape: is used to search any special character as a literal. You can escape using a backslash ‘’ in front of strings. This informs the computer to treat the following character as a search character and consider it for pattern matching.
  • Grouping: is done to combine a set of characters to behave as a single unit. Regex allows the grouping of elements by wrapping a parenthesis ‘()’ around the expression. Users can also match one or more elements by a vertical bar ‘|.’

Regex in SQL

SQL supports Pattern Matching operations based on the RegexP operator. It helps implement pattern search using a query in a database. Below are the patterns used in SQL:

PatternWhat pattern matches
[abc]Any character listed between square brackets
*Repetition of preceding string from zero to infinite times
+Repetition of preceding string from one to infinite times
.A single character
?Match a zero-or-one preceding string
$Matches end of a string
^Matches beginning of a string
[^abc]Any character not listed between square brackets
[a-z]Matches any lowercase letter
[ABC]Matches any uppercase letter
[0-9]Matches any digit from 0 to 9
[[:<:]]Matches the beginning of words
[[:>:]]Matches the end of words
a1 | a2 | a3Matches any of the patterns a1, a2, or a3

6 BigQuery Regex Functions

As BigQuery uses SQL at the backend, it allows users to match, search and replace or validate a string input. Analysis often requires you to extract only certain parts of a string (validate whether the string has a specific format) in order to replace or remove certain characters. Below are a few default Regex functions that can be used in BigQuery:

1) BigQuery Regex: RegexP_CONTAINS

RegexP_CONTAINS comprises two inputs and returns TRUE (BOOL type) even if the value is a partial match for the regular expression. If the regular expression argument is invalid, the function returns an error. To search for a full match, you can use ^ (caret symbol) at the beginning of a text and $ (dollar symbol) at the end of a text.

2) RegexP_EXTRACT

RegexP_EXTRACT considers two inputs that return a substring matching a regular expression. If the regular expression contains a capturing group (matching pattern), the function will return a substring that matches the capturing group. However, if the expression does not contain a capturing group, the function would return the entire matching string. If the expression doesn’t find any match, it returns NULL. RegexP_EXTRACT returns an error if:

  • The regular expression is invalid.
  • The regular expression has more than one capturing group.
  • The position is not a positive integer.
  • The occurrence is not a positive integer.

3) RegexP_EXTRACT_ALL

RegexP_EXTRACT_ALL returns an array of all substrings (non-overlapping) of value that matches regular expressions.

4) RegexP_INSTR

RegexP_INSTR considers source_value and regular expression, which returns the location (1-based index) of regular expression in a string. Both the inputs must be of the same type (STRING or BYTES) or else it would return an error if the Regex is invalid or has more than one capturing group. It produces zero if no value is found or the regular expression is empty.

5) RegexP_REPLACE

RegexP_REPLACE takes three inputs, value, regular expression, and the replacement argument. It returns a string where all substrings that match regular expressions are replaced with replacement arguments.

You can also implement a Regex pattern, having escaped digits (1 to 9) within the replacement argument to insert text matching a corresponding parenthesis group. A RegexP_REPLACE function only replaces non-overlapping matches, and if the argument is not a valid expression, the function will return an error.

6) RegexP_SUBSTR

RegexP_SUBSTR is a synonym of RegexP_EXTRACT, and it returns STRING or BYTES data type.

BigQuery Regex Syntax and Example Queries

Regex can be applied to a string by using the letter ‘r’ followed by single quotes having the entire regular expression pattern. To have a better understanding of calling Regex functions, follow the below syntax and example queries:

1) RegexP_CONTAINS

To apply RegexP_CONTAINS, follow the below syntax and example queries:

Syntax

CASE I: For instance, if the task is to find a valid email address, you can find it using the below query:

RegexP_CONTAINS: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

Take an example for a table called customers that contains information about the customers’ names, phone numbers, and emails.

BigQuery Regex
Image Source

We can utilize the provided pattern with the REGEXP_CONTAINS function to validate each email in the table. The query will take the following form:

SELECT name, email,
  REGEXP_CONTAINS(email,'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') AS valid_email
FROM customers;

The result will look like:

BigQuery Regex

CASE II: From the above example, if the task is to perform a full match, you can use caret (^) and dollar ($) symbols at the beginning and end of the string as shown in the below query:

RegexP_CONTAINS: ^YourPatternHere$

Replace “YourPatternHere” with the specific pattern you want to match from the beginning to the end of the string. This ensures a full match within the string.

2) RegexP_EXTRACT

The REGEXP_EXTRACT regex function also considers position and occurrence  of string to specify the search start (default starts from beginning) and return specific occurrences in value, respectively. The position must be a positive integer (0) and returns NULL if it is greater than the length of the value. The occurrence parameter either returns specific occurrences of Regex in values, or it returns the first match. Below is the syntax and example query to gain a better understanding.

Syntax

REGEXP_EXTRACT(value, regexp[, position[, occurrence]])

CASE I: If the task is to find user_name from an email address that involves extracting the initial string before ‘@’ symbol in an email, you can use ‘^’ as shown in the below query:

WITH email_addresses AS
  (SELECT 'foo@example.com' as email
  UNION ALL
  SELECT 'bar@example.org' as email
  UNION ALL
  SELECT 'baz@example.net' as email)

SELECT
  REGEXP_EXTRACT(email, r'^[a-zA-Z0-9_.+-]+')
  AS user_name
FROM email_addresses;

/*-----------*
 | user_name |
 +-----------+
 | foo       |
 | bar       |
 | baz       |
 *-----------*/

CASE II: For the same problem, if the task was to extract a domain name from email, you can use ‘$’ at the end of a string, as shown in the below query:

WITH email_addresses AS
  (SELECT 'foo@example.com' as email
  UNION ALL
  SELECT 'bar@example.org' as email
  UNION ALL
  SELECT 'baz@example.net' as email)

SELECT
  REGEXP_EXTRACT(email, r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)')
  AS top_level_domain
FROM email_addresses;

/*------------------*
 | top_level_domain |
 +------------------+
 | com              |
 | org              |
 | net              |
 *------------------*/

CASE III: If the task is to get a detailed list of positions and occurrences of desired patterns, use the below query:

WITH email_addresses AS
  (SELECT 'foo@example.com' as email
  UNION ALL
  SELECT 'bar@example.org' as email
  UNION ALL
  SELECT 'baz@example.net' as email)

SELECT
  email,
  REGEXP_EXTRACT(email, r'^([a-zA-Z0-9_.+-]+)@', 1) AS user_name,
  REGEXP_EXTRACT(email, r'@([a-zA-Z0-9-]+)\.([a-zA-Z0-9-.]+)$', 1) AS domain_name,
  REGEXP_EXTRACT(email, r'@([a-zA-Z0-9-]+)\.([a-zA-Z0-9-.]+)$', 2) AS top_level_domain
FROM email_addresses;

/*-------------------------------*
  | email              | user_name | domain_name | top_level_domain |
  +---------------------+-----------+-------------+------------------+
  | foo@example.com    | foo       | example     | com              |
  | bar@example.org    | bar       | example     | org              |
  | baz@example.net    | baz       | example     | net              |
  *---------------------+-----------+-------------+------------------*/

3) RegexP_EXTRACT_ALL

To apply RegexP_EXTRACT_ALL, follow the below syntax and example queries:

Syntax:

REGEXP_EXTRACT_ALL(value, regexp)

If you want to find all strings that have a pattern ‘function(),’ use the below command:

WITH code_markdown AS
  (SELECT 'Try `function(x)` or `function(y)`' as code)

SELECT
  REGEXP_EXTRACT_ALL(code, '`(.+?)`') AS example
FROM code_markdown;

/*----------------------------*
 | example                    |
 +----------------------------+
 | [function(x), function(y)] |
 *----------------------------*/

4) RegexP_INSTR

The search either starts at the specified position in source_values or begins from the beginning of source_value. If the position is negative, the function searches backward from the end of source_value, with -1 indicating the last character.

Syntax

REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])

If the occurrence is specified, the search returns the position of Regex in source_value and an index of the first occurrence. An occurrence cannot be zero or negative, and if occurrences are greater than one, the function searches for overlapping occurrences.

There is an optional occurence_position to specify the position with respect to the beginning of occurrence. If the choice is +0, it returns the beginning position of occurrence, whereas +1 returns the first position followed by the end of the occurrence.

CASE I: If your task is to find the position of ‘@’ in any given email address, use the below query:

WITH example AS (
  SELECT 'ab@cd-ef' AS source_value, '@[^-]*' AS regexp UNION ALL
  SELECT 'ab@d-ef', '@[^-]*' UNION ALL
  SELECT 'abc@cd-ef', '@[^-]*' UNION ALL
  SELECT 'abc-ef', '@[^-]*')
SELECT source_value, regexp, REGEXP_INSTR(source_value, regexp) AS instr
FROM example;

/*--------------+--------+-------*
 | source_value | regexp | instr |
 +--------------+--------+-------+
 | ab@cd-ef     | @[^-]* | 3     |
 | ab@d-ef      | @[^-]* | 3     |
 | abc@cd-ef    | @[^-]* | 4     |
 | abc-ef       | @[^-]* | 0     |
 *--------------+--------+-------*/

CASE II: For queries that involve more than one occurrence, REGEXP_INSTR returns the overlapping occurrences, as shown in the below query:

WITH example AS (
  SELECT 'a@cd-ef b@cd-ef' AS source_value, '@[^-]*' AS regexp, 1 AS position UNION ALL
  SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 2 UNION ALL
  SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 3 UNION ALL
  SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 4)
SELECT
  source_value, regexp, position,
  REGEXP_INSTR(source_value, regexp, position) AS instr
FROM example;

/*-----------------+--------+----------+-------*
 | source_value    | regexp | position | instr |
 +-----------------+--------+----------+-------+
 | a@cd-ef b@cd-ef | @[^-]* | 1        | 2     |
 | a@cd-ef b@cd-ef | @[^-]* | 2        | 2     |
 | a@cd-ef b@cd-ef | @[^-]* | 3        | 10    |
 | a@cd-ef b@cd-ef | @[^-]* | 4        | 10    |
 *-----------------+--------+----------+-------*/

CASE III: If the task is to show detailed information about position and occurrences of the desired pattern, use the below query:

WITH example AS (
  SELECT 'a@cd-ef b@cd-ef c@cd-ef' AS source_value,
         '@[^-]*' AS regexp, 1 AS position, 1 AS occurrence UNION ALL
  SELECT 'a@cd-ef b@cd-ef c@cd-ef', '@[^-]*', 1, 2 UNION ALL
  SELECT 'a@cd-ef b@cd-ef c@cd-ef', '@[^-]*', 1, 3)
SELECT
  source_value, regexp, position, occurrence,
  REGEXP_INSTR(source_value, regexp, position, occurrence) AS instr
FROM example;

/*-------------------------+--------+----------+------------+-------*
 | source_value            | regexp | position | occurrence | instr |
 +-------------------------+--------+----------+------------+-------+
 | a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1        | 1          | 2     |
 | a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1        | 2          | 10    |
 | a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1        | 3          | 18    |
 *-------------------------+--------+----------+------------+-------*/

5) RegexP_REPLACE

To apply RegexP_REPLACE, follow the below syntax and example queries:

Syntax

REGEXP_REPLACE(value, regexp, replacement)

Suppose the task is to replace ‘# ’ with <h1> at the beginning and </h1> at the end of a string, use the below query:

WITH markdown AS
  (SELECT '# Heading' as heading
  UNION ALL
  SELECT '# Another heading' as heading)

SELECT
  REGEXP_REPLACE(heading, r'^# ([a-zA-Z0-9\s]+$)', '<h1>\\1</h1>')
  AS html
FROM markdown;

/*--------------------------*
 | html                     |
 +--------------------------+
 | <h1>Heading</h1>         |
 | <h1>Another heading</h1> |
 *--------------------------*/

6) RegexP_SUBSTR

To apply RegexP_SUBSTR, follow the below syntax and example queries:

Syntax

REGEXP_SUBSTR(value, regexp[, position[, occurrence]])

For example, if you want to search a string, having ‘H?ello+’ as a regular expression pattern, use the below query:

WITH example AS
(SELECT 'Hello World Helloo' AS value, 'H?ello+' AS regex, 1 AS position, 1 AS
occurrence
)
SELECT value, regex, position, occurrence, REGEXP_SUBSTR(value, regex,
position, occurrence) AS regexp_value FROM example;

/*--------------------+---------+----------+------------+--------------*
 | value              | regex   | position | occurrence | regexp_value |
 +--------------------+---------+----------+------------+--------------+
 | Hello World Helloo | H?ello+ | 1        | 1          | Hello        |
 *--------------------+---------+----------+------------+--------------*/

BigQuery RegExp Tips & Tricks

BigQuery Regex: How to check if a column is a valid phone number

To verify if a column contains a valid phone number or not, you can use REGEXP_CONTAINS function. You can use the following regular expression to match any US formatted number:

^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$

An example query is given below:

SELECT phone_number,
REGEXP_CONTAINS(phone_number, r"^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$") AS is_valid 
FROM`projectname.datasetname.tablename`

How to check if a column does not contain a word

To check if a column does not contain a word, you can use the NOT LIKE operator.

This can be done by simple using the NOT operator. For example, using the regular expression with NOT to the start of the expression helps you check if a column does not contain a word.

SELECT message, 
NOT REGEXP_CONTAINS(message, r".*dolor.*") AS contains_word 
FROM `projectname.datasetname.tablename`

How to check if a column contains a word from another column

To check if a column contains a word from another column, the REGEXP_CONTAINS function can be used. You will also have to use the CONCAT function to concatenate the information from the other column with the regular expression pattern CONCAT(r'(?i)’, word, r”)

SELECT  message, word, 
REGEXP_CONTAINS(message, CONCAT(r'(?i)', word, r''))
FROM `projectname.datasetname.tablename`

How to replace a word

To replace a specific word from an entry with another one, you can use the REGEXP_REPLACE function.

SELECT message,
REGEXP_REPLACE(message, r"tellus", "REPLACED") AS replaced_word
FROM `projectname.datasetname.tablename`

How to split a string

To split a string, you can use the function REGEXP_EXTRACT. This regular expression can extract everything that appears before the first comma in a string.

SELECT message, 
REGEXP_EXTRACT(message, r"^(.+?),") AS extracted
FROM `projectname.datasetname.tablename`

How to replace special characters

To replace special characters, you can use regular expressions like this [^a-zA-Z0-9]+ and REGEXP_REPLACE function.

SELECT message, 
REGEXP_REPLACE(message, r"[^a-zA-Z0-9]+", "") AS removed_special_characters
FROM `projectname.datasetname.tablename`

How to extract a number from a string

To extract all numbers from a string, you can use the REGEXP_EXTRACT_ALL function again along with this regular expression ([0-9]+) to extract all numbers that appear in a given address.

SELECT address, 
REGEXP_EXTRACT_ALL(address, r"([0-9]+)") AS numbers
FROM `projectname.datasetname.tablename`

Importance of Regex in Programming

Many programming languages provide Regex either built-in or via libraries that have wide applications in Search Engines, URL Matching, and Lexical Analysis. Below are a few essential aspects of Regex:

  • Pattern Matching: Regular expressions are useful in matching partial or full patterns on a string. These patterns are either enclosed between single quotes or embedded in a constructor function. 
  • Ease of Usage: A regex pattern typically consists of simple characters or a combination of simple and special characters. As literals (character) form the most basic building block, most characters in Regex match a string effortlessly.
  • User-defined: As Regex supports a number of metacharacters with no specific hierarchy, it is user-friendly.

When should I use BigQuery Regex?

Regular expressions perform well in string manipulation and for analysis. Using the above BigQuery Regex and its tips and tricks, you can get started and boost your analysing and data manipulation capabilities very well. This leverages your overall skills in handling the unstructured and raw data.

Conclusion

Businesses accompanying traditional data warehouses often suffer performance degradations while querying extensive data. On the other hand, BigQuery provides consistent performance to leverage Data Analysis and manipulation tasks on large datasets.

Learn more about BigQuery with these essential reads:

In addition, with Regex functionality, BigQuery helps analysts to filter strings, and perform scans to identify patterns that match desired validations. In case you want to export data into your desired BigQuery Data Warehouse, then Hevo Data is the right choice for you! 

Share your experience of learning about BigQuery Regex Functions! Let us know in the comments section below!

Amit Kulkarni
Technical Content Writer, Hevo Data

Amit Kulkarni specializes in creating informative and engaging content on data science, leveraging his problem-solving and analytical thinking skills. He excels in delivering AI and automation solutions, developing generative chatbots, and providing data-driven AI & ML solutions. Amit holds a Master's degree and a Bachelor's degree in Electrical Engineering, consistently achieving distinction in his studies.

No-code Data Pipeline for BigQuery