One of Snowflake’s greatest strengths is how fast it can ingest both structured and unstructured data. But, understanding different aspects of handling Snowflake requires effort. You need a guide to figure out how Snowflake handles its JSON objects, and arrays in JSON data. This is exactly what we are going to cover in this blog.
In this post, you will look at how you can work with JSON files in Snowflake, how to directly query the Snowflake JSON data and copy it into a Snowflake table. Let’s get started!
Basic JSON Syntax
Rules:
- <KEY>:<VALUE> Data is in key/value pairs
- Data is separated by commas
- Objects are enclosed in curly braces ({})
- Arrays are enclosed in square brackets ([])
"movies":[
{"movie":"Iron Man", "producer":"Avi Arad"},
{"movie":"Sherlock Holmes", "producer":"Joel Silver"},
{"movie":"Dolittle", "producer":"Joe Roth"}
]
JSON Syntax Rewrite :
- Key/Value Pairs: JSON data is written as key/value pairs. It consists of a field name (in double quotes), followed by a colon, then the value. For example: “firstName”:”John”.
- Objects: In JSON, objects are a collection of name/value pairs and are enclosed in curly braces {}. An object can contain multiple name/value pairs. Each name is followed by a colon and the name/value pairs are separated by commas. For example: {“firstName”:”John”, “lastName”:”Doe”}.
- Arrays: Arrays in JSON are ordered collections and are enclosed in square brackets []. An array can contain multiple values, which can be strings, numbers, objects, arrays, boolean, null, etc. The values in an array are separated by commas. For example:
"movies":[
{"movie":"Iron Man", "producer":"Avi Arad"},
{"movie":"Sherlock Holmes", "producer":"Joel Silver"},
{"movie":"Dolittle", "producer":"Joe Roth"}
]
4. Values: JSON values can be:
- A number (integer or floating point). For example: 100, 3.14.
- A string (must be in double quotes). For example: “Hello”.
- A Boolean (true or false). For example: true, false.
- An array (enclosed in square brackets). For example: [“apple”, “banana”, “cherry”].
- An object (enclosed in curly braces). For example: {“name”:”John”, “age”:30}.
null , which represents an empty value.
Why Worry About JSON?
Over the years, Javascript has risen in popularity to become the most popular server-side scripting language. This has led to there being lots of JSON data out there.
- There are lots of applications and websites that use JSON for tracking web logs.
- Massive amounts of JSON data is being outputted by IoT and wMobile devices as they communicate with REST APIs in JSON format.
- Lots of open datasets are publicly published in JSON, for example;
- data.gov
- datasf.org
- opendata.cityofnewyork.us
- data.austintexas.gov
Data engineers and analysts often use the ‘snowflake parse json’ feature in Snowflake to extract valuable insights from semi-structured JSON data. This shows that there is a lot of data out there that can be incorporated into the analysis processes that companies are doing in their Data Warehouses.
Why Query Snowflake JSON Data?
In Snowflake, you can natively ingest semi-structured data not only in JSON but also in XML, Parquet, Avro, ORC, and other formats. This means that in Snowflake, you can efficiently store JSON data and then access it using SQL.
Snowflake JSON allows you to parse json into columns. You can then query this data with SQL and join it to other structured data without having to do any transformations. This allows organizations to reduce the complexity of their data pipelines as well as increase the speed at which this data becomes available for analysis. The ease of Snowflake JSON parsing makes it a preferred choice for organizations seeking streamlined data extraction and analysis from JSON files. In addition to this, Snowflake understands nesting, data types and the difference between objects and lists. Thus, Snowflake offers native support to JSON.
Load data MariaDB to Snowflake
Load data MySQL to Snowflake
Load data PostgreSQL to Snowflake
How does Snowflake handle Snowflake JSON Objects?
Snowflake does this using a data type they call a VARIANT. This data type allows semi-structured data to be loaded, as is, into a column called VARIANT in a relational table as you shall see later. As this data is ingested, Snowflake automatically discovers the attributes, keys, and structure that exists in the JSON document. Statistics are also collected in Snowflake’s metadata repository which enables optimization.
Consider an Employee table with different employee attributes created using statements as shown below:
The table structure for employee object wrapped in JSON object, then the table structure will look like this:
How to Parse JSON in Snowflake?
When working with complex JSON data in Snowflake, the ‘snowflake parse json into columns’ functionality simplifies the process of extracting structured information. Parse_json Snowflake function is a powerful tool that allows you to effortlessly extract and manipulate JSON data within your data warehouse.
Below is an example of using VARIANT Data types to Parse and Insert JSON objects in the Snowflake table.
Follow the steps below to parse Snowflake JSON data in Snowflake:
Step 1: Create a Table
Execute a simple create statement.
create or replace table json_table (v variant);
This command creates a single-column table with a column “v”.
Step 2: Load JSON Data
Load a sample JSON document using a simple INSERT statement and Snowflake’s PARSE_JSON function.
insert into json_table
select
parse_json(
'{
"fullName":"Robert Downey",
"age":55,
"gender":"Male",
"address": {
"areaCode":"91506",
"suite":"916 West Burbank Blvd"
},
"movies": [
{"name":"Iron Man","budget":"$150M","producer":"Avi Arad"},
{"name":"Sherlock Holmes","budget":"$200M","producer":"Joel Silver"},
{"name":"Dolittle","budget":"$175M","producer":"Joe Roth"}
]
}');
In this example, you have an array of highly nested objects. In a production environment, you would use snowpipe or a copy command e.g.
copy into json_table from '@demo_stage/path_to_file/file1.json;
OR
copy into json_table from 's3://demo_bucket/file1.json;
Snowflake parses the JSON into sub-columns based on the key-value pairs at load time. These keys are recorded as pointers in metadata for optimization. Structural information is dynamically derived based on the schema definition embedded in the JSON string (Schema-on-Read).
No other characters except colon (:) are allowed after column names, especially for the root element. Also, keys in JSON files are case-sensitive.
Consider another example of an employee table in Snowflake. On execution, the record looks as shown below:
Step 3: Start Pulling Data
Let’s get the data from the sub-column “fullName”.
select v:fullName from json_table;
1 Row Produced | |
row # | V:FULLNAME |
1 | “Robert Downey” |
A colon is used to reference the JSON sub-columns where:
v = the column name in the json_table table
fullName = attribute in the JSON schema
v:fullName = notation to indicate which attribute in column “v” we want to select
Step 4: Casting the Data
The table that was returned in the previous step came back with double quotes around it and now you are going to cast that data into an actual data type.
select v:fullName::string as full_name from json_table;
1 Row Produced | |
row # | V:FULL_NAME |
1 | Robert Downey |
You used :: to cast to a string data type and added an alias with “as” just like in regular SQL.
You can now query JSON data without learning a new programming language.
select
v:fullName::string as full_name,
v:age::int as age,
v:gender::string as gender
from json_table
1 Row Produced | | | |
row # | FULL_NAME | AGE | GENDER |
1 | Robert Downey | 55 | Male |
On execution, it shows the data type of each attribute
Handling More Snowflake JSON Complexity
The sample Snowflake JSON document that you used contains some nested data. “address” is a sub-column and “areaCode” and “suite” are sub-columns of the sub-column. You can access these columns using a very familiar table.column dot notation.
select
v:address.areaCode::string as area_code,
v:address.suite::string as suite
from json_table;
1 Row Produced | | |
row # | AREA_CODE | SUITE |
1 | 91506 | 916 West Burbank Blvd |
If the structure changes and a new attribute-like state is added to the Snowflake JSON document, any SQL you’ve written already will work. It is also fairly easy to modify the query and easily adapt the SQL, for example;
select
v:address.areaCode::string as area_code,
v:address.suite::string as suite,
v:address.state::string as state
from json_table;
If the reverse happens and an attribute is dropped, the query will not fail. It will simply return a NULL value.
Consider another example of an employee table where the employee has two phone numbers. Use column dot notation to query this.
So, no matter how complex your JSON data is, it can be easily accessed in Snowflake.
Handling Arrays in Snowflake JSON
You can embed arrays inside Snowflake JSON documents, for example;
"movies": [
{"name": "Iron Man", "budget": "$150", "producer": "Avi Arad"},
{"name": "Sherlock Homes", "budget": "$200", "producer": "Joel Silver"},
{"name": "Dolittle", "budget": "$175", "producer": "Joe Roth"}
]
“movies” is a sub-column with an array. There are 3 rows in the array and each row has 3 sub-columns (name, budget, and producer). To pull the data of each row, use FLATTEN with the FROM clause and give it a table alias. FLATTEN takes an array and returns a row for each element in the array. It selects all the data in the array as though it were rows in the table.
select
f.value:name::string as movie_name,
f.value:budget::string as movie_budget,
f.value:producer::string as movie_producer
from json_table, table(flatten(v:movies)) f;
f = table alias for the flattened array
name = sub-column in the array
value = returns the value of the sub-column
3 Rows Produced | | | |
row # | MOVIE_NAME | MOVIE_BUDGET | MOVIE_PRODUCER |
1 | Iron Man | $150M | Avi Arad |
2 | Sherlock Holmes | $200M | Joel Silver |
3 | Dolittle | $175M | Joe Roth |
Method 2: Parsing JSON Arrays Directly from a Staged JSON File
You can access the first entry in the JSON object array within the JSON data file uploaded as json_sample_data2 like so:
SELECT 'The First Employee Record is '||
S.$1:root[0].employees[0].firstName||
' '||S.$1:root[0].employees[0].lastName
FROM @~/json/json_sample_data2 (file_format => 'json') as S;
The returned output would be: The First Employee Record Is: John Doe.
Utilizing the FLATTEN Function for JSON Array Parsing
The FLATTEN function is utilized to parse JSON arrays. FLATTEN is a table function that transforms a repeated field into a set of rows. It takes a single scalar value containing multiple values for a repeated field and unrolls it into multiple records, generating one record for each value of the previously repeated field. Any non-repeated fields are duplicated to populate each of the new records that are created. FLATTEN effectively removes one level of nesting.
In the context of uploaded JSON data stored in a single column (S.$1), the provided query parses the array within the json_sample_data file. When the query applies the LATERAL FLATTEN function, it results in three values for the fullName field.
select t.value
from @~/json/json_sample_data.gz (file_format => 'json') as S
, table(flatten(S.$1,'root')) t;
select t.value:fullName
from @~/json/json_sample_data.gz (file_format => 'json') as S,
table(flatten(S.$1,'root')) t;
Consider another example of JSON data containing novel information of three authors. Here, we are using the Lateral FLATTEN function on the Category array.
Image Source
Image Source
The final query and output are:
Method 3: Using the PARSE_JSON Function
This method parses text as a JSON document and returns a VARIANT value. If the input is NULL, the output is also NULL. If the input text is ‘null’, it is processed as a JSON null value, therefore the output is not a SQL NULL, but a legal VARIANT value containing null (the distinction is obvious when displaying this VARIANT value).
SELECT t.value:children[0],
t.value:children[0].age,
parse_json(t.value:children[0].age)
FROM @~/json/json_sample_data.gz (file_format => 'json') as S, table(flatten(S.$1,'root')) t;
Method 4: Using the GET_PATH Function
The GET_PATH method extracts a value from structured data based on a path name. It is a GET variation that accepts a variation, OBJECT, or ARRAY value as the first parameter and then extracts the VARIANT value of the field or element using the path name as the second argument.
Path name notation is a typical JavaScript notation consisting of a concatenation of field names (identifiers) followed by dots and index operators [index]. The leading dot in the first field name is optional. The index operations accept non-negative decimal integers (for arrays) or single or double-quoted text literals (for objects).
GET_PATH represents a chain of GET functions. If the path name matches any element, it returns NULL.
SELECT GET_PATH(S.$1, 'root[0].fullName')
FROM @~/json/json_sample_data.gz (file_format => 'json') as S;
SELECT t.value:fullName
FROM @~/json/json_sample_data.gz (file_format => 'json') as S
, table(flatten(S.$1,'root')) t;
SELECT
t.value[0]:children as First_person_children,
t.value[1]:children as Second_person_children,
t.value[2]:children as Third_Person_children
FROM @~/json/json_sample_data.gz (file_format => 'json') as S
, table(flatten(S.$1,'')) t;
Output
The SELECT command returns the whole contents of the json_sample_data file (some results are shown):):
+-----------------------------------+
| VALUE |
|-----------------------------------|
| [ |
| { |
| "age": 22, |
| "children": [ |
| { |
| "age": "6", |
| "gender": "Female", |
| "name": "Jane" |
| }, |
| { |
| "age": "15", |
| "gender": "Male", |
| "name": "John" |
| } |
| ], |
| "citiesLived": [ |
| { |
| "place": "Seattle", |
| "yearsLived": [ |
| "1995" |
| ] |
| }, |
| { |
| "place": "Stockholm", |
| "yearsLived": [ |
| "2005" |
| ] |
| } |
| ], |
| "fullName": "John Doe", |
| "gender": "Male", |
| "kind": "person", |
| "phoneNumber": { |
| "areaCode": "206", |
| "number": "1234567" |
| } |
| }, |
..
+-----------------------------------
SELECT t.value
FROM @~/json/json_sample_data.gz (file_format => 'json') as S
, table(flatten(S.$1,'root')) t;
Output
The three data arrays representing the three workers’ records and their dependant data are returned.
SELECT
array_size(t.value)
FROM @~/json/json_sample_data.gz (file_format => 'json') as S
, table(flatten(S.$1,'')) t;
Output
In the JSON example file used in this lesson, there are three employee records and all of their dependent entries (children’s names, ages, and so on).
SELECT emps.value:fullName, citliv.value:place, citliv.value:yearsLived , yrliv.value
FROM @~/json/json_sample_data.gz (file_format => 'json') as S,
table(flatten(S.$1,'root')) emps,
table(flatten(emps.value:citiesLived,'')) citliv ,
table(flatten(citliv.value:yearsLived,'')) yrliv;
Output:
SELECT emps.value:fullName, citliv.value:place, count(yrliv.value) as Years_Count
FROM @~/json/json_sample_data.gz (file_format => 'json') as S,
table(flatten(S.$1,'root')) emps,
table(flatten(emps.value:citiesLived,'')) citliv,
table(flatten(citliv.value:yearsLived,'')) yrliv
group by emps.value:fullName, citliv.value:place
order by 1,2;
Conclusion
In summary, you can see that the VARIANT data type offers native support for querying Snowflake JSON without the need to analyze the structure ahead of time or perform complex transformations. Querying Snowflake JSON data provides the same performance as all the standard relational data types.
Hevo is the only real-time ELT No-code Data Pipeline platform that cost-effectively automates data pipelines that are flexible to your needs. With integration with 150+ Data Sources (40+ free sources), we help you not only export data from sources & load data to the destinations but also transform & enrich your data, & make it analysis-ready.
Try Hevo today, a fully managed data pipeline for all your data needs.
Visit our Website to Explore Hevo
Want to take Hevo for a spin?
Sign Up and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable Hevo pricing that will help you choose the right plan for your business needs.
Share your experience with Snowflake JSON in the comments section below!
References:
- Load and query JSON data on Snowflake
Vivek Sinha is a seasoned product leader with over 10 years of expertise in revolutionizing real-time analytics and cloud-native technologies. He specializes in enhancing Apache Pinot, focusing on query processing and data mutability. Vivek is renowned for his strategic vision and ability to deliver cutting-edge solutions that empower businesses to harness the full potential of their data.