Regular expressions are an incredibly powerful tool for manipulating, searching, and transforming text data. Also known as regex, they provide a concise way to match strings of text against specified patterns. Many data professionals, however, find them cryptic and intimidating to work with.
In this post, we will explore BigQuery Regex. We will see how regular expressions work in Google BigQuery, specifically looking at BigQuery’s REGEXP_CONTAINS and REGEXP_EXTRACT functions for matching and extracting patterns from text data using SQL syntax.
What is BigQuery
BigQuery is Google Cloud’s fully managed, serverless, petabyte-scale data warehouse solution. Since its inception in 2012, it has provided cost-effective and real-time analytics on a vast amount of data that helps businesses to gain agility. With BigQuery, analysts get the support of ANSI-SQL at the backend, which helps users to focus more on finding meaningful insights swiftly.
In addition, BigQuery facilitates built-in features like Machine Learning, Geospatial Analysis, and Business Intelligence for organizations to manage and analyze data effectively.
Introduction to Regex
Originated in 1951, Regex, abbreviated as the regular expression, is a sequence of characters used for pattern matching with strings or string matching. As regular expressions are used in every programming language like Java, Python, and SQL, it is helpful to match patterns with a sequence of characters. To get a broader view, the importance and understanding of Regex are as discussed below.
What is BigQuery Regex?
BigQuery is one of the most popular Cloud Data Warehouses used by millions of people all over the world for storage and analysis. In BigQuery Analysis, there are a lot of cases where one would want to extract certain parts of a string, validate the format of a string, and even replace or remove certain characters. This is where BigQuery Regex comes into the picture. As discussed, regular expressions are a pattern or a sequence of characters used for pattern matching. It helps you perform various string-matching operations in BigQuery.
Implementing regex functions in BigQuery can be complex, but Hevo simplifies the process with its intuitive no-code platform. Here’s how Hevo makes working with BigQuery regex easier:
- Create and manage regex patterns with a visual, drag-and-drop transformation feature.
- Consolidate the process of data migration and transformation.
- See results in real time to make the analysis easier and more efficient.
Try Hevo and join a growing community of 2000+ data professionals who rely on Hevo for seamless and efficient migrations and transformations.
Move Data to BigQuery for Free
Understanding Regex and Pattern Matching
A regular expression is composed of various elements that become the building blocks for constructing patterns. To understand Regex and Pattern Matching, consider a few components mentioned below:
- Simple Patterns: It consists of patterns that help find a direct match of characters. These patterns may also include combinations of strings or literals but should be in an exact sequence.
- Special Patterns: If a search requires more than a direct match, it accompanies a combination of many symbols, characters, and numbers arranged to achieve desired results. Below are a few elements:
- Repeaters: consists of symbols that direct a program to repeat a preceding character more than once. It comprises of ‘*,’ ‘+,’ {}‘ to repeat (up to infinite times) a particular character in a sequence.
- Wildcard: consists of a dot symbol that can take the place of any symbol, and hence called wildcard character. It can be used to match any single character (letter, digits, and white space).
- Symbols: Regex utilizes many symbols to compute a pattern. Following are a few symbols used:
- Optional character (?) tells the computer that the preceding character may or may not be present in the matching string.
- Caret symbol (^) helps to set the position for matching at the beginning of a string or line.
- Dollar symbol ($) assists the computer in setting the position for matching at the end of a string or line.
- Character class: matches a set of characters to match the most basic elements of language. Below are a few character classes:
/s : | matches any whitespace characters like space and tab |
/S : | matches any non-whitespace characters |
/d : | matches any digit character |
/D : | matches any non-digit characters |
/w : | matches any word character (alpha-numeric) |
/W : | matches any non-word character |
- Escape: is used to search any special character as a literal. You can escape using a backslash ‘’ in front of strings. This informs the computer to treat the following character as a search character and consider it for pattern matching.
- Grouping: is done to combine a set of characters to behave as a single unit. Regex allows the grouping of elements by wrapping a parenthesis ‘()’ around the expression. Users can also match one or more elements by a vertical bar ‘|.’
Regex in SQL
SQL supports Pattern Matching operations based on the RegexP operator. It helps implement pattern search using a query in a database. Below are the patterns used in SQL:
Pattern | What pattern matches |
[abc] | Any character listed between square brackets |
* | Repetition of preceding string from zero to infinite times |
+ | Repetition of preceding string from one to infinite times |
. | A single character |
? | Match a zero-or-one preceding string |
$ | Matches end of a string |
^ | Matches beginning of a string |
[^abc] | Any character not listed between square brackets |
[a-z] | Matches any lowercase letter |
[ABC] | Matches any uppercase letter |
[0-9] | Matches any digit from 0 to 9 |
[[:<:]] | Matches the beginning of words |
[[:>:]] | Matches the end of words |
a1 | a2 | a3 | Matches any of the patterns a1, a2, or a3 |
6 BigQuery Regex Functions
As BigQuery uses SQL at the backend, it allows users to match, search and replace or validate a string input. Analysis often requires you to extract only certain parts of a string (validate whether the string has a specific format) in order to replace or remove certain characters. Below are a few default Regex functions that can be used in BigQuery:
BigQuery Transformations Made Easy!
No credit card required
1) BigQuery Regex: RegexP_CONTAINS
RegexP_CONTAINS comprises two inputs and returns TRUE (BOOL type) even if the value is a partial match for the regular expression. If the regular expression argument is invalid, the function returns an error. To search for a full match, you can use ^ (caret symbol) at the beginning of a text and $ (dollar symbol) at the end of a text.
2) RegexP_EXTRACT
RegexP_EXTRACT considers two inputs that return a substring matching a regular expression. If the regular expression contains a capturing group (matching pattern), the function will return a substring that matches the capturing group. However, if the expression does not contain a capturing group, the function would return the entire matching string. If the expression doesn’t find any match, it returns NULL. RegexP_EXTRACT returns an error if:
- The regular expression is invalid.
- The regular expression has more than one capturing group.
- The position is not a positive integer.
- The occurrence is not a positive integer.
3) RegexP_EXTRACT_ALL
RegexP_EXTRACT_ALL returns an array of all substrings (non-overlapping) of value that matches regular expressions.
4) RegexP_INSTR
RegexP_INSTR considers source_value and regular expression, which returns the location (1-based index) of regular expression in a string. Both the inputs must be of the same type (STRING or BYTES) or else it would return an error if the Regex is invalid or has more than one capturing group. It produces zero if no value is found or the regular expression is empty.
5) RegexP_REPLACE
RegexP_REPLACE takes three inputs, value, regular expression, and the replacement argument. It returns a string where all substrings that match regular expressions are replaced with replacement arguments.
You can also implement a Regex pattern, having escaped digits (1 to 9) within the replacement argument to insert text matching a corresponding parenthesis group. A RegexP_REPLACE function only replaces non-overlapping matches, and if the argument is not a valid expression, the function will return an error.
6) RegexP_SUBSTR
RegexP_SUBSTR is a synonym of RegexP_EXTRACT, and it returns STRING or BYTES data type.
BigQuery Regex Syntax and Example Queries
Regex can be applied to a string by using the letter ‘r’ followed by single quotes having the entire regular expression pattern. To have a better understanding of calling Regex functions, follow the below syntax and example queries:
1) RegexP_CONTAINS
To apply RegexP_CONTAINS
, follow the below syntax and example queries:
Syntax
CASE I: For instance, if the task is to find a valid email address, you can find it using the below query:
RegexP_CONTAINS: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Take an example for a table called customers that contains information about the customers’ names, phone numbers, and emails.
We can utilize the provided pattern with the REGEXP_CONTAINS
function to validate each email in the table. The query will take the following form:
SELECT name, email,
REGEXP_CONTAINS(email,'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') AS valid_email
FROM customers;
The result will look like:
CASE II: From the above example, if the task is to perform a full match, you can use caret (^) and dollar ($) symbols at the beginning and end of the string as shown in the below query:
RegexP_CONTAINS: ^YourPatternHere$
Replace “YourPatternHere” with the specific pattern you want to match from the beginning to the end of the string. This ensures a full match within the string.
2) RegexP_EXTRACT
The REGEXP_EXTRACT regex function also considers position and occurrence of string to specify the search start (default starts from beginning) and return specific occurrences in value, respectively. The position must be a positive integer (0) and returns NULL if it is greater than the length of the value. The occurrence parameter either returns specific occurrences of Regex in values, or it returns the first match. Below is the syntax and example query to gain a better understanding.
Syntax
REGEXP_EXTRACT(value, regexp[, position[, occurrence]])
CASE I: If the task is to find user_name from an email address that involves extracting the initial string before ‘@’ symbol in an email, you can use ‘^’ as shown in the below query:
WITH email_addresses AS
(SELECT 'foo@example.com' as email
UNION ALL
SELECT 'bar@example.org' as email
UNION ALL
SELECT 'baz@example.net' as email)
SELECT
REGEXP_EXTRACT(email, r'^[a-zA-Z0-9_.+-]+')
AS user_name
FROM email_addresses;
/*-----------*
| user_name |
+-----------+
| foo |
| bar |
| baz |
*-----------*/
CASE II: For the same problem, if the task was to extract a domain name from email, you can use ‘$’ at the end of a string, as shown in the below query:
WITH email_addresses AS
(SELECT 'foo@example.com' as email
UNION ALL
SELECT 'bar@example.org' as email
UNION ALL
SELECT 'baz@example.net' as email)
SELECT
REGEXP_EXTRACT(email, r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)')
AS top_level_domain
FROM email_addresses;
/*------------------*
| top_level_domain |
+------------------+
| com |
| org |
| net |
*------------------*/
CASE III: If the task is to get a detailed list of positions and occurrences of desired patterns, use the below query:
WITH email_addresses AS
(SELECT 'foo@example.com' as email
UNION ALL
SELECT 'bar@example.org' as email
UNION ALL
SELECT 'baz@example.net' as email)
SELECT
email,
REGEXP_EXTRACT(email, r'^([a-zA-Z0-9_.+-]+)@', 1) AS user_name,
REGEXP_EXTRACT(email, r'@([a-zA-Z0-9-]+)\.([a-zA-Z0-9-.]+)$', 1) AS domain_name,
REGEXP_EXTRACT(email, r'@([a-zA-Z0-9-]+)\.([a-zA-Z0-9-.]+)$', 2) AS top_level_domain
FROM email_addresses;
/*-------------------------------*
| email | user_name | domain_name | top_level_domain |
+---------------------+-----------+-------------+------------------+
| foo@example.com | foo | example | com |
| bar@example.org | bar | example | org |
| baz@example.net | baz | example | net |
*---------------------+-----------+-------------+------------------*/
3) RegexP_EXTRACT_ALL
To apply RegexP_EXTRACT_ALL, follow the below syntax and example queries:
Syntax:
REGEXP_EXTRACT_ALL(value, regexp)
If you want to find all strings that have a pattern ‘function(),’ use the below command:
WITH code_markdown AS
(SELECT 'Try `function(x)` or `function(y)`' as code)
SELECT
REGEXP_EXTRACT_ALL(code, '`(.+?)`') AS example
FROM code_markdown;
/*----------------------------*
| example |
+----------------------------+
| [function(x), function(y)] |
*----------------------------*/
4) RegexP_INSTR
The search either starts at the specified position in source_values or begins from the beginning of source_value. If the position is negative, the function searches backward from the end of source_value, with -1 indicating the last character.
Syntax
REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])
If the occurrence is specified, the search returns the position of Regex in source_value and an index of the first occurrence. An occurrence cannot be zero or negative, and if occurrences are greater than one, the function searches for overlapping occurrences.
There is an optional occurence_position to specify the position with respect to the beginning of occurrence. If the choice is +0, it returns the beginning position of occurrence, whereas +1 returns the first position followed by the end of the occurrence.
CASE I: If your task is to find the position of ‘@’ in any given email address, use the below query:
WITH example AS (
SELECT 'ab@cd-ef' AS source_value, '@[^-]*' AS regexp UNION ALL
SELECT 'ab@d-ef', '@[^-]*' UNION ALL
SELECT 'abc@cd-ef', '@[^-]*' UNION ALL
SELECT 'abc-ef', '@[^-]*')
SELECT source_value, regexp, REGEXP_INSTR(source_value, regexp) AS instr
FROM example;
/*--------------+--------+-------*
| source_value | regexp | instr |
+--------------+--------+-------+
| ab@cd-ef | @[^-]* | 3 |
| ab@d-ef | @[^-]* | 3 |
| abc@cd-ef | @[^-]* | 4 |
| abc-ef | @[^-]* | 0 |
*--------------+--------+-------*/
CASE II: For queries that involve more than one occurrence, REGEXP_INSTR returns the overlapping occurrences, as shown in the below query:
WITH example AS (
SELECT 'a@cd-ef b@cd-ef' AS source_value, '@[^-]*' AS regexp, 1 AS position UNION ALL
SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 2 UNION ALL
SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 3 UNION ALL
SELECT 'a@cd-ef b@cd-ef', '@[^-]*', 4)
SELECT
source_value, regexp, position,
REGEXP_INSTR(source_value, regexp, position) AS instr
FROM example;
/*-----------------+--------+----------+-------*
| source_value | regexp | position | instr |
+-----------------+--------+----------+-------+
| a@cd-ef b@cd-ef | @[^-]* | 1 | 2 |
| a@cd-ef b@cd-ef | @[^-]* | 2 | 2 |
| a@cd-ef b@cd-ef | @[^-]* | 3 | 10 |
| a@cd-ef b@cd-ef | @[^-]* | 4 | 10 |
*-----------------+--------+----------+-------*/
CASE III: If the task is to show detailed information about position and occurrences of the desired pattern, use the below query:
WITH example AS (
SELECT 'a@cd-ef b@cd-ef c@cd-ef' AS source_value,
'@[^-]*' AS regexp, 1 AS position, 1 AS occurrence UNION ALL
SELECT 'a@cd-ef b@cd-ef c@cd-ef', '@[^-]*', 1, 2 UNION ALL
SELECT 'a@cd-ef b@cd-ef c@cd-ef', '@[^-]*', 1, 3)
SELECT
source_value, regexp, position, occurrence,
REGEXP_INSTR(source_value, regexp, position, occurrence) AS instr
FROM example;
/*-------------------------+--------+----------+------------+-------*
| source_value | regexp | position | occurrence | instr |
+-------------------------+--------+----------+------------+-------+
| a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1 | 1 | 2 |
| a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1 | 2 | 10 |
| a@cd-ef b@cd-ef c@cd-ef | @[^-]* | 1 | 3 | 18 |
*-------------------------+--------+----------+------------+-------*/
5) RegexP_REPLACE
To apply RegexP_REPLACE, follow the below syntax and example queries:
Syntax
REGEXP_REPLACE(value, regexp, replacement)
Suppose the task is to replace ‘# ’ with <h1> at the beginning and </h1> at the end of a string, use the below query:
WITH markdown AS
(SELECT '# Heading' as heading
UNION ALL
SELECT '# Another heading' as heading)
SELECT
REGEXP_REPLACE(heading, r'^# ([a-zA-Z0-9\s]+$)', '<h1>\\1</h1>')
AS html
FROM markdown;
/*--------------------------*
| html |
+--------------------------+
| <h1>Heading</h1> |
| <h1>Another heading</h1> |
*--------------------------*/
6) RegexP_SUBSTR
To apply RegexP_SUBSTR, follow the below syntax and example queries:
Syntax
REGEXP_SUBSTR(value, regexp[, position[, occurrence]])
For example, if you want to search a string, having ‘H?ello+’ as a regular expression pattern, use the below query:
WITH example AS
(SELECT 'Hello World Helloo' AS value, 'H?ello+' AS regex, 1 AS position, 1 AS
occurrence
)
SELECT value, regex, position, occurrence, REGEXP_SUBSTR(value, regex,
position, occurrence) AS regexp_value FROM example;
/*--------------------+---------+----------+------------+--------------*
| value | regex | position | occurrence | regexp_value |
+--------------------+---------+----------+------------+--------------+
| Hello World Helloo | H?ello+ | 1 | 1 | Hello |
*--------------------+---------+----------+------------+--------------*/
BigQuery RegExp Tips & Tricks
BigQuery Regex: How to check if a column is a valid phone number
To verify if a column contains a valid phone number or not, you can use REGEXP_CONTAINS function. You can use the following regular expression to match any US formatted number:
^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$
An example query is given below:
SELECT phone_number,
REGEXP_CONTAINS(phone_number, r"^(+d{1,2}s)?(?d{3})?[s.-]?d{3}[s.-]?d{4}$") AS is_valid
FROM`projectname.datasetname.tablename`
How to check if a column does not contain a word
To check if a column does not contain a word, you can use the NOT LIKE operator.
This can be done by simple using the NOT operator. For example, using the regular expression with NOT to the start of the expression helps you check if a column does not contain a word.
SELECT message,
NOT REGEXP_CONTAINS(message, r".*dolor.*") AS contains_word
FROM `projectname.datasetname.tablename`
How to check if a column contains a word from another column
To check if a column contains a word from another column, the REGEXP_CONTAINS function can be used. You will also have to use the CONCAT function to concatenate the information from the other column with the regular expression pattern CONCAT(r'(?i)’, word, r”)
SELECT message, word,
REGEXP_CONTAINS(message, CONCAT(r'(?i)', word, r''))
FROM `projectname.datasetname.tablename`
How to replace a word
To replace a specific word from an entry with another one, you can use the REGEXP_REPLACE function.
SELECT message,
REGEXP_REPLACE(message, r"tellus", "REPLACED") AS replaced_word
FROM `projectname.datasetname.tablename`
How to split a string
To split a string, you can use the function REGEXP_EXTRACT. This regular expression can extract everything that appears before the first comma in a string.
SELECT message,
REGEXP_EXTRACT(message, r"^(.+?),") AS extracted
FROM `projectname.datasetname.tablename`
How to replace special characters
To replace special characters, you can use regular expressions like this [^a-zA-Z0-9]+ and REGEXP_REPLACE function.
SELECT message,
REGEXP_REPLACE(message, r"[^a-zA-Z0-9]+", "") AS removed_special_characters
FROM `projectname.datasetname.tablename`
How to extract a number from a string
To extract all numbers from a string, you can use the REGEXP_EXTRACT_ALL function again along with this regular expression ([0-9]+) to extract all numbers that appear in a given address.
SELECT address,
REGEXP_EXTRACT_ALL(address, r"([0-9]+)") AS numbers
FROM `projectname.datasetname.tablename`
Importance of Regex in Programming
Many programming languages provide Regex either built-in or via libraries that have wide applications in Search Engines, URL Matching, and Lexical Analysis. Below are a few essential aspects of Regex:
- Pattern Matching: Regular expressions are useful in matching partial or full patterns on a string. These patterns are either enclosed between single quotes or embedded in a constructor function.
- Ease of Usage: A regex pattern typically consists of simple characters or a combination of simple and special characters. As literals (character) form the most basic building block, most characters in Regex match a string effortlessly.
- User-defined: As Regex supports a number of metacharacters with no specific hierarchy, it is user-friendly.
When should I use BigQuery Regex?
Regular expressions perform well in string manipulation and for analysis. Using the above BigQuery Regex and its tips and tricks, you can get started and boost your analysing and data manipulation capabilities very well. This leverages your overall skills in handling the unstructured and raw data.
Conclusion
Businesses accompanying traditional data warehouses often suffer performance degradations while querying extensive data. On the other hand, BigQuery provides consistent performance to leverage Data Analysis and manipulation tasks on large datasets.
Learn more about BigQuery with these essential reads:
In addition, with Regex functionality, BigQuery helps analysts to filter strings, and perform scans to identify patterns that match desired validations. In case you want to export data into your desired BigQuery Data Warehouse, then Hevo Data is the right choice for you!
Share your experience of learning about BigQuery Regex Functions! Let us know in the comments section below!
Amit Kulkarni specializes in creating informative and engaging content on data science, leveraging his problem-solving and analytical thinking skills. He excels in delivering AI and automation solutions, developing generative chatbots, and providing data-driven AI & ML solutions. Amit holds a Master's degree and a Bachelor's degree in Electrical Engineering, consistently achieving distinction in his studies.