In today’s world organizations are constantly looking for data-driven insights. However, it can be difficult to Extract, Store, and Analyze that data as it rapidly grows in scale and scope. Redshift, Amazon’s enterprise Data Warehouse, is designed to make large-scale Data Analytics accessible to everyone. When the size of your recorded data increases to Gigabytes, Terabytes, or even Petabytes, your organization needs a more efficient system like a Data Warehouse to manage the massive amount of data.
Amazon Redshift offers no shortage of functionality to help you get the most out of JSON data, but it can be hard to know what Redshift JSON Functions to use and when. This article will introduce you to Amazon Redshift. Furthermore, this article will also describe the different types of Redshift JSON Functions along with their syntax and examples. Read along to learn more about Redshift JSON Functions.
Prerequisites
- Hands-on experience with SQL Queries.
- An AWS account with a provisioned Redshift cluster
What is Redshift?
Redshift is a fully managed Data Warehousing solution provided by AWS. A Data Warehouse often gets confused with a database, although it is different. It is designed for OLAP (Online Analytical Processing) use cases, instead of OLTP (Online Transactions Processing). It is, therefore, optimized for lengthy, complex queries on terabytes, or even petabytes of data.
Instead of rows, Redshift organizes data in columns. This allows it to use its Massive Parallel Processing (MPP) technology to speed up query execution. It claims to be 10x better in terms of performance than any other Data Warehouse. It is well suited for everyday business operations, like BI analysis. As you would have expected, it integrates seamlessly with several other AWS services like Aurora, S3, Quicksight, etc.
JSON with Redshift
Redshift provides support for JSON Strings. They are essentially stored as either CHAR or VARCHAR data types (VARCHAR to be used when the string contains multi-byte characters). Note that JSON should be used sparingly when working with Redshift. The performance of Redshift suffers when dealing with semi-structured data types, like JSON. This is because JSON can contain unrelated data in a single column. One row may contain two keys in a JSON column, and the next row may contain three different keys in the same column.
Therefore, JSON doesn’t use the column store architecture of Redshift, and cannot benefit from the advantages of columnar data storage in Redshift. There are very few cases where it makes sense to use JSON in Redshift. Reducing a very sparse table (with most cells as null), into a single JSON column is one of them, as it can help you save storage space.
Understanding SUPER Data Type
Redshift has come up with a new data type (SUPER), for dealing with JSON and other Semistructured data. Going into the specifics of this data type will require a separate article in itself. However, you can read more about it here. The JSON functions that we discuss in this article can work on columns having the CHAR or VARCHAR datatype. However, Redshift recommends using the SUPER data type for processing data in a JSON serialization format.
When dealing with JSON, you may often want to navigate through the JSON, extracting values corresponding to specific keys, using the dot notation (SELECT json1.key2 from table_with_json_column) or extract values corresponding to specific positions in a JSON array using the bracket notation (SELECT json_array1[0] from table_with_json_array_column). Such queries won’t be possible in Redshift if your JSON data is stored as CHAR or VARCHAR. It is only possible when it is stored as SUPER and is queried using PartiQL. You can read more here.
Having understood the basics of Redshift, JSON with Redshift, and the SUPER data type, let’s now focus on the theme of this article: Redshift JSON Functions. Without much ado, let’s get started.
A fully managed No-code Data Pipeline platform like Hevo Data helps you integrate and load data from 150+ different sources (including 40+ free sources) to a Data Warehouse such as Amazon Redshift or Destination of your choice in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources allows users to bring in data of different kinds in a smooth fashion without having to code a single line.
Get Started with Hevo for free
7 Key Redshift JSON Functions
Querying JSON Fields can easily be done with the help of Redshift JSON Functions. Some of the most used Redshift JSON Functions are discussed below:
1) JSON_PARSE
It is one of the most commonly used Redshift JSON function that is used to convert a string into a SUPER data type. It is used for inserting or updating data in a table in the SUPER data type.
The syntax is straightforward.
JSON_PARSE(json_string)
For example,
SELECT JSON_PARSE('{"foo":"bar"}') as myjson
To understand the usage of JSON_PARSE in INSERT/UPDATE queries, consider the following:
CREATE TABLE table_with_super_col(
col1 SUPER
);
INSERT INTO table_with_super_col(col1)
VALUES(JSON_PARSE('{"foo":"bar"}')),
('{"foo":"bar"}')
As you can see, we are inserting two values in the col1 column (type SUPER), one with JSON_PARSE, other without it. If you query the table, this is what you will see:
As you can see, the insert with JSON_PARSE got parsed correctly, while the one without didn’t.
2) JSON_SERIALIZE
This is the opposite of JSON_PARSE. It takes a SUPER expression as the input and returns a textual representation (VARCHAR) as the output. The syntax, again, is straightforward.
JSON_SERIALIZE(super_expression)
Example (continuing on the table created in the previous section):
SELECT JSON_SERIALIZE(col1) from table_with_super_col
This will return the text equivalent of col1.
Note: SUPER size limit (~ block limit) is higher than the VARCHAR size limit. Therefore, the JSON_SERIALIZE function can return an error if the size of the converted JSON exceeds the VARCHAR size limit.
3) IS_VALID_JSON
This function is used to validate if a JSON string represents a valid JSON. The rules for a JSON to be valid are defined here. It returns a boolean depending on whether the string is a valid JSON. The syntax, again, is straightforward:
IS_VALID_JSON(json_string)
For example
SELECT IS_VALID_JSON('{"temperature":54,"humidity":1.2}')
The result here will be true. Similarly, for
SELECT IS_VALID_JSON('{"temperature":54;"humidity":1.2}')
The result will be false (there is a semicolon instead of the comma). Please note that this function will not work for JSON arrays (it will return false if a JSON array is given as an input). Use the below function for JSON arrays.
4) IS_VALID_JSON_ARRAY
This is similar to the above, except that it works for JSON arrays. The syntax is similar to the above.
IS_VALID_JSON_ARRAY(json_array_string)
For example:
SELECT IS_VALID_JSON_ARRAY('["temperature","humidity"]')
Returns true. However,
SELECT IS_VALID_JSON_ARRAY('["temperature","humidity":{"foo":"bar"}]')
Returns false.
5) JSON_ARRAY_LENGTH
As the name suggests, this function returns the number of elements of a JSON array. Note that in the case of nested arrays, it returns the number of elements in the outer array. The syntax is:
JSON_ARRAY_LENGTH(json_array_string [, null_if_invalid])
The second argument is optional, and if set to true, it will return null if the JSON array is invalid. If set to false (default), it will return an error in case of an invalid array. For example
SELECT JSON_ARRAY_LENGTH('["temperature","humidity",["foo","bar"]]')
Returns 3.
SELECT JSON_ARRAY_LENGTH('["temperature","humidity",["foo":"bar"]]')
Returns an error as the JSON array is invalid.
6) JSON_EXTRACT_ARRAY_ELEMENT_TEXT
As the name indicates, this function allows you to extract the text at a specific position from the JSON array. Note that the positions start from 0. The syntax is:
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(json_array_string, pos [, null_if_invalid])
The first argument is the string representing the JSON array, the second argument is the position of the element that we want to extract as text, and the third optional argument tells the function to return NULL if the array string is not valid. By default, it is set to false, in which case, the function will return an error if the string is invalid.
If you provide a negative or out-of-bounds value for the position, the function returns an empty string (NULL).
Note that the return type is VARCHAR.
Let’s look at some examples.
SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 0, true)
Returns ‘temperature’
SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 2, true)
Returns ‘[“foo”:”bar”]’
SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo","bar"]]', 3, true)
Returns NULL.
SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["temperature","humidity",["foo":"bar"]]', 0, false)
Returns an error as the JSON string is invalid.
7) JSON_EXTRACT_PATH_TEXT
This is similar to extracting values from a JSON by referencing keys. The syntax is:
JSON_EXTRACT_PATH_TEXT(json_string, path_elem1 [, path_elem2 [, …] ] [, null_if_invalid])
You can add upto 5 path elements (only 1 is compulsory), meaning you can extract values from a nested JSON up to 5 levels deep. The optional null_if_valid argument, when set to true, returns NULL if the json_string is invalid. If set to false (default), it returns an error when the JSON string is invalid.
If a path element does not exist, this function returns an empty string (NULL).
The return type is VARCHAR.
Let’s consider a few examples:
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key1')
Returns ‘value1’
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key3')
Returns ‘value3’.
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key5','key6')
Returns ‘value6’
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6":"value6"}}}', 'key2', 'key5','key7')
Returns NULL.
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6","value6"}}}', 'key2', 'key5','key6')
Returns an error (‘key6’ and ‘value6’ are separated by a comma instead of a colon, rendering the JSON invalid). Whereas
SELECT JSON_EXTRACT_PATH_TEXT('{"key1":"value1","key2":{"key3":"value3","key4":"value4","key5":{"key6","value6"}}}', 'key2', 'key5','key6', true)
Returns NULL.
Operators and Functions for SUPER Data Type
We saw how the JSON_SERIALIZE function can be used for converting a JSON string or array to the SUPER data type. Once converted, there are some functions and operators that you can use on the SUPER columns. These include arithmetic operators like +, -, *, etc., arithmetic functions like FLOOR, ROUND, CEIL, etc., and array functions like ARRAY_CONCAT, ARRAY_FLATTEN, SUBARRAY, etc. You can read more about these functions (along with examples), here.
Learn More About:
Harnessing Redshift Super Data Type
Conclusion
This article introduced you to some of the most important Redshift JSON Functions along with their syntax and example queries. We’d like to reiterate again that JSON should be used sparingly with Redshift, as the performance of Redshift suffers when dealing with semi-structured data. You also learned about the SUPER data type in Redshift, how it is associated to the Redshift JSON Functions, and how you can further explore functions and operators that work on the SUPER data type.
With your Data Warehouse, Amazon Redshift live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where Hevo Data comes into the picture.
Visit our Website to Explore Hevo
Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations such as Amazon Redshift, with a few clicks. Hevo Data with its strong integration with 150+ sources (including 40+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.
Frequently Asked Questions
1. Can Redshift handle JSON data?
Yes, Amazon Redshift can handle JSON data. Redshift provides the SUPER
data type to store semi-structured data like JSON and allows querying this data using the PARTITION BY
and FLATTEN
functions.
2. What does the JSON () function do?
In general, the JSON()
function is used to create a JSON object in databases or programming languages. However, this function’s exact behavior can vary depending on the context, such as converting a value to a JSON format or extracting JSON data.
3. What does JSON parse () do?
JSON.parse()
is a JavaScript function that converts a JSON string into a JavaScript object. It is commonly used to parse JSON data received from a web server or read from a file.
Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.