Redshift JSON Functions 101: Key Types, Syntax, and Example Queries Simplified

• December 14th, 2021

Redshift JSON Functions

In today’s world organizations are constantly looking for data-driven insights. However, it can be difficult to Extract, Store, and Analyze that data as it rapidly grows in scale and scope. Redshift, Amazon’s enterprise Data Warehouse, is designed to make large-scale Data Analytics accessible to everyone. When the size of your recorded data increases to Gigabytes, Terabytes, or even Petabytes, your organization needs a more efficient system like a Data Warehouse to manage the massive amount of data.

Amazon Redshift offers no shortage of functionality to help you get the most out of JSON data, but it can be hard to know what Redshift JSON Functions to use and when. This article will introduce you to Amazon Redshift. Furthermore, this article will also describe the different types of Redshift JSON Functions along with their syntax and examples. Read along to learn more about Redshift JSON Functions.

Table of Contents

Prerequisites

  • Hands-on experience with SQL Queries.
  • An AWS account with a provisioned Redshift cluster

What is Redshift?

Redshift Logo
Image Source

Redshift is a fully managed Data Warehousing solution provided by AWS. A Data Warehouse often gets confused with a database, although it is different. It is designed for OLAP (Online Analytical Processing) use cases, instead of OLTP (Online Transactions Processing). It is, therefore, optimized for lengthy, complex queries on terabytes, or even petabytes of data.

Instead of rows, Redshift organizes data in columns. This allows it to use its Massive Parallel Processing (MPP) technology to speed up query execution. It claims to be 10x better in terms of performance than any other Data Warehouse. It is well suited for everyday business operations, like BI analysis. As you would have expected, it integrates seamlessly with several other AWS services like Aurora, S3, Quicksight, etc.

JSON with Redshift

Redshift provides support for JSON Strings. They are essentially stored as either CHAR or VARCHAR data types (VARCHAR to be used when the string contains multi-byte characters). Note that JSON should be used sparingly when working with Redshift. The performance of Redshift suffers when dealing with semi-structured data types, like JSON. This is because JSON can contain unrelated data in a single column. One row may contain two keys in a JSON column, and the next row may contain three different keys in the same column.

Therefore, JSON doesn’t use the column store architecture of Redshift, and cannot benefit from the advantages of columnar data storage in Redshift. There are very few cases where it makes sense to use JSON in Redshift. Reducing a very sparse table (with most cells as null), into a single JSON column is one of them, as it can help you save storage space. 

Understanding SUPER Data Type

Redshift has come up with a new data type (SUPER), for dealing with JSON and other Semistructured data. Going into the specifics of this data type will require a separate article in itself. However, you can read more about it here. The JSON functions that we discuss in this article can work on columns having the CHAR or VARCHAR datatype. However, Redshift recommends using the SUPER data type for processing data in a JSON serialization format. 

When dealing with JSON, you may often want to navigate through the JSON, extracting values corresponding to specific keys, using the dot notation (SELECT json1.key2 from table_with_json_column) or extract values corresponding to specific positions in a JSON array using the bracket notation (SELECT json_array1[0] from table_with_json_array_column). Such queries won’t be possible in Redshift if your JSON data is stored as CHAR or VARCHAR. It is only possible when it is stored as SUPER and is queried using PartiQL. You can read more here.

Having understood the basics of Redshift, JSON with Redshift, and the SUPER data type, let’s now focus on the theme of this article: Redshift JSON Functions. Without much ado, let’s get started.

Simplify Redshift ETL and Analysis with Hevo’s No-code Data Pipeline

A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 100+ different sources (including 40+ free sources) to a Data Warehouse such as Amazon Redshift or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line. 

Get Started with Hevo for free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Transformations: Hevo provides preload transformations through Python code. It also allows you to run transformation code for each event in the Data Pipelines you set up. You need to edit the event object’s properties received in the transform method as a parameter to carry out the transformation. Hevo also offers drag and drop transformations like Date and Control Functions, JSON, and Event Manipulation to name a few. These can be configured and tested before putting them to use.
  • Connectors: Hevo supports 100+ integrations to SaaS platforms, files, Databases, analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Firebolt, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.  
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (including 40+ free sources) that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-day Free Trial!

7 Key Redshift JSON Functions

Querying JSON Fields can easily be done with the help of Redshift JSON Functions. Some of the most used Redshift JSON Functions are discussed below:

1) JSON_PARSE

It is one of the most commonly used Redshift JSON function that is used to convert a string into a SUPER data type. It is used for inserting or updating data in a table in the SUPER data type. 

The syntax is straightforward.

JSON_PARSE(json_string)

For example, 

SELECT JSON_PARSE('{"foo":"bar"}') as myjson

To understand the usage of JSON_PARSE in INSERT/UPDATE queries, consider the following:

CREATE TABLE table_with_super_col(
  	col1 SUPER
  );

INSERT INTO table_with_super_col(col1)
VALUES(JSON_PARSE('{"foo":"bar"}')),
('{"foo":"bar"}')

As you can see, we are inserting two values in the col1 column (type SUPER), one with JSON_PARSE, other without it. If you query the table, this is what you will see:

Redshift JSON Functions: JSON_Parse
Image Source: Self

As you can see, the insert with JSON_PARSE got parsed correctly, while the one without didn’t.

2) JSON_SERIALIZE

This is the opposite of JSON_PARSE. It takes a SUPER expression as the input and returns a textual representation (VARCHAR) as the output. The syntax, again, is straightforward.

JSON_SERIALIZE(super_expression)

Example (continuing on the table created in the previous section):

SELECT JSON_SERIALIZE(col1) from table_with_super_col

This will return the text equivalent of col1.

Redshift JSON Functions: JSON_Serialize
Image Source: Self

Note: SUPER size limit (~ block limit) is higher than the VARCHAR size limit. Therefore, the JSON_SERIALIZE function can return an error if the size of the converted JSON exceeds the VARCHAR size limit.

3) IS_VALID_JSON

This function is used to validate if a JSON string represents a valid JSON. The rules for a JSON to be valid are defined here. It returns a boolean depending on whether the string is a valid JSON. The syntax, again, is straightforward:

IS_VALID_JSON(json_string)

For example

SELECT IS_VALID_JSON('{"temperature":54,"humidity":1.2}')

The result here will be true. Similarly, for 

SELECT IS_VALID_JSON('{"temperature":54;"humidity":1.2}')

The result will be false (there is a semicolon instead of the comma). Please note that this function will not work for JSON arrays (it will return false if a JSON array is given as an input). Use the below function for JSON arrays.

4) IS_VALID_JSON_ARRAY

This is similar to the above, except that it works for JSON arrays. The syntax is similar to the above.

IS_VALID_JSON_ARRAY(json_array_string)

For example:

SELECT IS_VALID_JSON_ARRAY('["temperature","humidity"]')

Returns true. However,

SELECT IS_VALID_JSON_ARRAY('["temperature","humidity":{"foo":"bar"}]')

Returns false.

5) JSON_ARRAY_LENGTH

As the name suggests, this function returns the number of elements of a JSON array. Note that in the case of nested arrays, it returns the number of elements in the outer array. The syntax is:

JSON_ARRAY_LENGTH(json_array_string [, null_if_invalid])

The second argument is optional, and if set to true, it will return null if the JSON array is invalid. If set to false (default), it will return an error in case of an invalid array. For example

SELECT JSON_ARRAY_LENGTH('["temperature","humidity",["foo","bar"]]')

Returns 3.

SELECT JSON_ARRAY_LENGTH('["temperature","humidity",["foo":"bar"]]')

Returns an error as the JSON array is invalid.

6) JSON_EXTRACT_ARRAY_ELEMENT_TEXT

As the name indicates, this function allows you to extract the text at a specific position from the JSON array. Note that the positions start from 0. The syntax is:

JSON_EXTRACT_ARRAY_ELEMENT_TEXT(json_array_string, pos [, null_if_invalid])

The first argument is the string representing the JSON array, the second argument is the position of the element that we want to extract as text, and the third optional argument tells the function to return NULL if the array string is not valid. By default, it is set to false, in which case, the function will return an error if the string is invalid.

If you provide a negative or out-of-bounds value for the position, the function returns an empty string (NULL). 

Note that the return type is VARCHAR.

Let’s look at some examples.

SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 0, true)

Returns ‘temperature’

SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 2, true)

Returns ‘[“foo”:”bar”]’

SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 3, true)

Returns NULL.

SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo":"bar"]]', 0, false)

Returns an error as the JSON string is invalid.

7) JSON_EXTRACT_PATH_TEXT

This is similar to extracting values from a JSON by referencing keys. The syntax is:

JSON_EXTRACT_PATH_TEXT(json_string, path_elem1 [, path_elem2 [, …] ] [, null_if_invalid])

You can add upto 5 path elements (only 1 is compulsory), meaning you can extract values from a nested JSON up to 5 levels deep. The optional null_if_valid argument, when set to true, returns NULL if the json_string is invalid. If set to false (default), it returns an error when the JSON string is invalid. 

If a path element does not exist, this function returns an empty string (NULL).

The return type is VARCHAR.

Let’s consider a few examples:

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key1')

Returns ‘value1’

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key3')

Returns ‘value3’.

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key5','key6')

Returns ‘value6’

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key5','key7')

Returns NULL.

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6","value6"}}}', 'key2', 'key5','key6')

Returns an error (‘key6’ and ‘value6’ are separated by a comma instead of a colon, rendering the JSON invalid). Whereas

SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6","value6"}}}', 'key2', 'key5','key6', true)

Returns NULL. 

Operators and Functions for SUPER Data Type

We saw how the JSON_SERIALIZE function can be used for converting a JSON string or array to the SUPER data type. Once converted, there are some functions and operators that you can use on the SUPER columns. These include arithmetic operators like +, -, *, etc., arithmetic functions like FLOOR, ROUND, CEIL, etc., and array functions like ARRAY_CONCAT, ARRAY_FLATTEN, SUBARRAY, etc. You can read more about these functions (along with examples), here.

Conclusion

This article introduced you to some of the most important Redshift JSON Functions along with their syntax and example queries. We’d like to reiterate again that JSON should be used sparingly with Redshift, as the performance of Redshift suffers when dealing with semi-structured data. You also learned about the SUPER data type in Redshift, how it is associated to the Redshift JSON Functions, and how you can further explore functions and operators that work on the SUPER data type.

With your Data Warehouse, Amazon Redshift live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where Hevo Data comes into the picture.

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Amazon Redshift, with a few clicks. Hevo Data with its strong integration with 100+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your experience of learning about Redshift JSON Functions. Let us know in the comments section below!

No-code Data Pipeline for Amazon Redshift