Data is empowering; it can help transform your business. To enable that transformation, businesses are collecting and storing as much data as they can. But some of this data such as emails, spreadsheets, and IOT logs comes in a semi structured state, and many businesses are making the mistake of transforming it to a structured format way too early in its lifecycle. This adds extra cost constraints and also means that they lose the ability to work with atomic data when needed.
This has seen more and more businesses adopting Snowflake — a blazing fast serverless, multi-cloud data warehouse. One of the best things about Snowflake is that you can load and query Snowflake semi-structured data in a way that Snowflake becomes sort of a mixture between a data lake and a data warehouse.
This page details how to query Snowflake’s semi-structured data. It also details how you can cast data types and then use the LATERAL and FLATTEN functions to access individual fields.
What is Snowflake?
- Snowflake is a fully-managed cloud data warehouse that is designed to help you modernize your analytics strategy. Using standard SQL, Snowflake allows you to start querying Gigabytes and Petabytes of both structured and semi-structured data in seconds.
- Snowflake automatically harnesses thousands of CPU cores to quickly execute queries for you. You can even query streaming data from your web, mobile apps, or IoT devices in real-time.
- Snowflake is secure and meets the most secure regulatory standards such as HIPAA, FedRAMP, and PCI DSS. Data is encrypted in transit and at rest by default and it’s automatically replicated, restored, and backed up to ensure business continuity.
Key Features of Snowflake
The Snowflake platform is a SaaS-based analytical, versatile, and intuitive Data Warehouse service. Some of the key features of Snowflake are:
- High Processing Speed
- Scalable
- Separate Storage and Compute Layer
- Disaster Recovery
- Pay Per Use Model
What is Semi-Structured Data?
Semi-structured data is a type of structured data that does not adhere to the tabular structure of data models associated with relational databases or other types of data tables, but still includes tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. As a result, it’s also referred to as a self-descriptive structure.
Even though they are grouped together in semi-structured data, entities belonging to the same class may have different attributes, and the order of the attributes is irrelevant.
Since the advent of the Internet, where full-text documents and databases are no longer the only forms of data and different applications require a medium for exchanging information, semi-structured data has become more common. Semi-structured data is common in object-oriented databases.
Data can come from a variety of places, including apps, sensors, and mobile phones. Semi-structured data formats (such as JSON, Avro, ORC, Parquet, and XML) have become popular standards for transporting and storing data to support these various data sources.
Why Query Semi-Structured Data?
Semi-structured data has become far more common in recent years as organizations embrace applications in systems that output data in file formats like JSON, Avro, ORC, and Parquet. The problem is that there are few easy and fast storage platforms that allow you to store and analyze semi-structured data at the same time.
Often semi-structured data files just get dumped into NoSQL tools like Hadoop and are forgotten because it’s too hard to query the data and use it performantly. Traditional relational databases and their NoSQL options:
- Have had limited support for semi-structured data
- Require hours of custom work to transform semi-structured data
- Require specialized NoSQL skills and knowledge
- Are siloed and disparate
- Are extremely slow.
This pushes many organizations to try and transform their data and place it in a relational database which takes even more effort and negates the advantages of semi-structured data.
Extrapolating even further, it’s often difficult to combine semi-structured data with existing business data so potential information and insights are often missed. Snowflake on the other hand completely supports these file formats and allows you to use them and query them without prior transformation using special SQL functions and operators as if they are structured datasets.
What is Supported File Formats for Semi-structured Data?
Libraries (usually with extensions) are used to generate JSON data in non-JavaScript applications. Generators of ad hoc JavaScript.
JSON document concatenation (which may or may not be line-separated). There are significant differences between different implementations due to the lack of a formal specification. If the JSON parser is strict in its language definition, these differences make importing JSON-like data sets impossible. Snowflake follows the rule “be liberal in what you accept” to make importing JSON data sets as trouble-free as possible. The goal is to accept as many JSON and JSON-like inputs as possible, allowing for unambiguous interpretation.
JSON data is a collection of name/value pairs organized into objects and arrays in a hierarchical structure:
- In name/value pairs, colons are used to separate names and values.
- Objects are indicated by curly braces.
- Arrays are denoted by square brackets [].
- Separate entities in objects and arrays with commas.
Examples of JSON Documents
{"root":[{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}]}
There are three simple employee records in this array.
{"root":
[
{ "kind": "person",
"fullName": "John Doe",
"age": 22,
"gender": "Male",
"phoneNumber":
{"areaCode": "206",
"number": "1234567"},
"children":
[
{
"name": "Jane",
"gender": "Female",
"age": "6"
},
{
"name": "John",
"gender": "Male",
"age": "15"
}
],
"citiesLived":
[
{
"place": "Seattle",
"yearsLived": ["1995"]
},
{
"place": "Stockholm",
"yearsLived": ["2005"]
}
]
},
{"kind": "person", "fullName": "Mike Jones", "age": 35, "gender": "Male", "phoneNumber": { "areaCode": "622", "number": "1567845"}, "children": [{ "name": "Earl", "gender": "Male", "age": "10"}, {"name": "Sam", "gender": "Male", "age": "6"}, { "name": "Kit", "gender": "Male", "age": "8"}], "citiesLived": [{"place": "Los Angeles", "yearsLived": ["1989", "1993", "1998", "2002"]}, {"place": "Washington DC", "yearsLived": ["1990", "1993", "1998", "2008"]}, {"place": "Portland", "yearsLived": ["1993", "1998", "2003", "2005"]}, {"place": "Austin", "yearsLived": ["1973", "1998", "2001", "2005"]}]},
{"kind": "person", "fullName": "Anna Karenina", "age": 45, "gender": "Female", "phoneNumber": { "areaCode": "425", "number": "1984783"}, "citiesLived": [{"place": "Stockholm", "yearsLived": ["1992", "1998", "2000", "2010"]}, {"place": "Russia", "yearsLived": ["1998", "2001", "2005"]}, {"place": "Austin", "yearsLived": ["1995", "1999"]}]}
]
}
There are three employee records (objects) in this array, as well as their dependent data.
- Avro: Avro is an open-source data serialisation and RPC framework that was built with Apache Hadoop in mind. It generates serialised data in a compact binary format by using JSON schemas. Because the schema is included in the data, the serialised data can be sent to any destination (i.e., an application or programme) and easily deserialized there.
An Avro schema is made up of a JSON string, object, or array that specifies the schema type and data attributes (field names, data types, and so on). Depending on the schema type, the attributes are different. Arrays and maps are examples of complex data types that are supported.
Avro data is read into a single VARIANT column by Snowflake. You can use the same commands and functions to query the data in a VARIANT column as you would for JSON data.
Example of an Avro Schema
{
"type": "record",
"name": "person",
"namespace": "example.avro",
"fields": [
{"name": "fullName", "type": "string"},
{"name": "age", "type": ["int", "null"]},
{"name": "gender", "type": ["string", "null"]}
]
}
- ORC: The ORC (Optimized Row Columnar) file format, which is used to store Hive data, was created with the goal of providing better compression and performance for reading, writing, and processing data than previous Hive file formats. ORC data is read into a single VARIANT column in Snowflake. You can use similar commands and functions to query the data in a VARIANT column as you would JSON data.
You can also use the CREATE TABLE AS SELECT statement to extract columns from a staged ORC file into separate table columns.
The ORC file format is a binary one.
Example of ORC Data Loaded into a VARIANT Column
+--------------------------------------+
| SRC |
|--------------------------------------|
| { |
| "boolean1": false, |
| "byte1": 1, |
| "bytes1": "0001020304", |
| "decimal1": 12345678.654745, |
| "double1": -1.500000000000000e+01, |
| "float1": 1.000000000000000e+00, |
| "int1": 65536, |
| "list": [ |
| { |
| "int1": 3, |
| "string1": "good" |
| }, |
| { |
| "int1": 4, |
| "string1": "bad" |
| } |
| ] |
| } |
+--------------------------------------+
- Parquet: Parquet is a columnar data representation designed for Hadoop projects that is compressed and efficient. Dremel record shredding and assembly algorithms are used in the file format, which supports complex nested data structures. Parquet data is read into a single VARIANT column by Snowflake. You can use the same commands and functions to query the data in a VARIANT column as you would for JSON data. The schema of Parquet files created with the Parquet writer v1 is supported by Snowflake. Data loading is not supported for files created with the writer version 2.Alternatively, you can use a CREATE TABLE AS SELECT statement to extract selected columns from a staged Parquet file into separate table columns. The Parquet format is a binary one.
| SRC |
|------------------------------------------|
| { |
| "continent": "Europe", |
| "country": { |
| "city": { |
| "bag": [ |
| { |
| "array_element": "Paris" |
| }, |
| { |
| "array_element": "Nice" |
| }, |
| { |
| "array_element": "Marseilles" |
| }, |
| { |
| "array_element": "Cannes" |
| } |
| ] |
| }, |
| "name": "France" |
| } |
| } |
+------------------------------------------+
- XML: The markup language XML (Extensible Markup Language) defines a set of rules for encoding documents. It was based on SGML, another markup language that was created to standardize the structure and elements of a document.
Since its initial focus on documents, XML has expanded to include a wide range of applications, including the representation of arbitrary data structures and serving as the base language for communication protocols. It has become one of the most widely used standards for data interchange on the Web due to its extensibility, versatility, and usability.
The following constructs make up the majority of an XML document:
- Tags (identified by angle brackets, < and >)
- Elements
Elements are typically made up of a “start” tag and a “end” tag, with the text between the tags serving as the element’s content. An element can also be made up entirely of “empty-element” tags with no “end” tags. The “start” and “empty-element” tags can have attributes that help define the element’s characteristics or metadata.
How to Query Snowflake Semi-structured Data?
Prerequisites
- Access to a Snowflake account
- Install the Snowflake CLI
- Run the following script to create the resources you’ll need for this guide:
create or replace warehouse devwarehouse with
warehouse_size='X-SMALL'
auto_suspend = 120
auto_resume = true
initially_suspended=true;
use warehouse devwarehouse;
Example 1: Querying Snowflake Semi-Structured Data on Non-nested JSON File
The JSON file in our example defines an object with 3 properties:
The destination table will only have one column that will store the JSON object with the VARIANT data type. Use the following workbook to load the JSON file into a Snowflake table:
create database users;
use database users;
use schema users.public;
create or replace table json_user_data (src variant);
insert into json_user_data (src)
select parse_json('{
"name": "Sarah Banks",
"email": "sarahbanks@census.gov",
"ip_address": "28.56.176.2"
}, {
"name": "Giovani Giorgio",
"email": "ggiorgio@senate.gov",
"ip_address": "237.189.4.217"
}, {
"name": "Liam Nielsen",
"email": "nielsen@imageshack.us",
"ip_address": "150.33.122.255"
}, {
"name": "Frederico Fred",
"email": "ffred3@aol.com",
"ip_address": "76.67.177.39"
}');
First, let’s run a query to see the contents of the src variant column in the json_user_data table:
select * from json_user_data;
SRC |
{ “name”: “Sarah Banks”, “email”: “sarahbanks@census.gov”, “ip_address”: “28.56.176.2”}, { “name”: “Giovani Giorgio”, “email”: “ggiorgio@senate.gov”, “ip_address”: “237.189.4.217”}, { “name”: “Liam Nielsen”, “email”: “nielsen@imageshack.us”, “ip_address”: “150.33.122.255”}, { “name”: “Frederico Fred”, “email”: “ffred3@aol.com”, “ip_address”: “76.67.177.39”} |
You are now going to use the FLATTEN SQL function to convert every Key/Value pair that you have in your src VARIANT column into a row. This will form a relational table.
SELECT src:name::varchar AS NAME,
src:email::varchar AS EMAIL,
src:ip_address AS IP_ADDRESS
FROM json_user_data,
LATERAL FLATTEN(input => users.src);
After running this SQL query you will see that it gives you a structured view of your data.
Row | NAME | EMAIL | IP_ADDRESS |
1 | Sarah Banks | sarahbanks@census.gov | 28.56.176.2 |
2 | Giovani Giorgio | ggiorgio@senate.gov | 237.189.4.217 |
3 | Liam Nielsen | nielsen@imageshack.us | 150.33.122.255 |
4 | Frederico Fred | ffred3@aol.com | 76.67.177.39 |
Example 2: Querying Snowflake Semi-Structured Data on Nested JSON File
In this example, we will try to flatten a nested JSON file.
The following JSON dataset contains data about Grammy award-winning songs. Load the dataset into a new songs table and cast it within the VARIANT column using the following script:
create database songs;
use database songs;
create or replace table json_songs_data (src variant);
insert into json_songs_data (src)
select parse_json('{
"song_title":"I Can't Breathe",
"year":2020,
"artists": [
{
"first_name":"Dernst",
"middle_name":"Emile",
"last_name":"II"
},
{
"first_name":"Gabriella",
"middle_name":"Sarmiento",
"last_name":"Wilson"
},
{
"first_name":"Tiara",
"middle_name":"Nicole",
"last_name":"Thomas"
}
]
}
{
"song_title":"This is America",
"year":2019,
"artists":
[{
"first_name":"Donald",
"middle_name":null,
"last_name":"Glover"
},
{
"first_name":"Ludwig",
"middle_name":null,
"last_name":"Göransson"
},
{
"first_name":"Jeffery",
"middle_name":"Lamar",
"last_name":"Williams"
}
]
}
{
"song_title":"That’s What I Like",
"year":2018,
"artists":
[{
"first_name":"Peter",
"middle_name":"Gene",
"last_name":"Hernandez"
}
]
}
{
"song_title":"Hello",
"year":2017,
"artists":
[{
"first_name":"Adele",
"middle_name":"Laurie",
"last_name":"Adkins"
}
]
},
{
"song_title":"Thinking Out Loud",
"year":2016,
"artists":
[{
"first_name":"Edward",
"middle_name":"Christopher",
"last_name":"Sheeran"
}
]
}');
The following query will flatten the NESTED JSON data so that each member of the array will become its own row. The lateral modifier on the other hand will join the data with any information outside the object.
SELECT
value:first_name::VARCHAR AS FIRST_NAME,
value:middle_name::VARCHAR AS MIDDLE_NAME,
value:last_name::VARCHAR AS LAST_NAME,
src:song_title::VARCHAR AS SONG_TITLE,
src:year::VARCHAR AS YEAR
FROM json_songs_data
,LATERAL FLATTEN(input => src:artists);
As expected, the SQL query results display a more structured view of the data. Flatten is taking all those elements and is creating new rows. Lateral on the other hand is joining it back to the original row and is producing those columns. In other words, flatten makes the table longer while lateral makes it wider.
Row | FIRST_NAME | MIDDLE_NAME | LAST_NAME | SONG_TITLE | YEAR |
1 | Dernst | Emile | II | I Can’t Breathe | 2020 |
2 | Gabriella | Sarmiento | Wilson | I Can’t Breathe | 2020 |
3 | Tiara | Nicole | Thomas | I Can’t Breathe | 2020 |
4 | Donald | NULL | Glover | This is America | 2019 |
5 | Ludwig | NULL | Göransson | This is America | 2019 |
6 | Jeffery | Lamar | Williams | This is America | 2019 |
7 | Peter | Gene | Hernandez | That’s What I Like | 2018 |
8 | Adele | Laurie | Adkins | Hello | 2017 |
9 | Edward | Christopher | Sheeran | Thinking Out Loud | 2016 |
What are the Semi-structured Data Types?
Other data types can be contained within the Snowflake data types listed below:
- “VARIANT” (can contain any other data type):
Any other type of value, including OBJECT and ARRAY, can be stored in a VARIANT.
A VARIANT can be up to 16 MB in length. You can use the “CAST” function, the “TO VARIANT” function, or the “::” operator to explicitly cast a value to or from the VARIANT data type (e.g. expression::variant). Use IIS to directly insert VARIANT data (INSERT INTO … SELECT).
The following example demonstrates how to add JSON-formatted data to a VARIANT:
insert into varia (v)
select to_variant(parse_json('{"key3": "value3", "key4": "value4"}'));
Typically, a VARIANT is used when:
- You want to create hierarchical data by defining a hierarchy with two or more ARRAYs or OBJECTs explicitly.
- You want to load JSON, Avro, ORC, or Parquet data without having to specify the data’s hierarchical structure.
Snowflake can convert data from JSON, Avro, ORC, or Parquet into an internal hierarchy of ARRAY, OBJECT, and VARIANT data and store it directly in a VARIANT. Although you can create the data hierarchy manually, it is usually more convenient to have Snowflake do it for you.
- “ARRAY” (can directly contain VARIANT, and thus indirectly contain any other data type, including itself): An array in many other programming languages is similar to a Snowflake ARRAY. There are 0 or more data items in an ARRAY. The array’s position is used to access each element.
VARIANT is the type of each value in an ARRAY. (A VARIANT can have any data type as a value.)
Casting values of other data types to VARIANT and storing them in an array is possible. ARRAY CONSTRUCT, for example, can implicitly cast values to VARIANT.
The underlying data types of the values in an ARRAY do not have to be identical because ARRAYs store VARIANT values, and VARIANT values can store other data types within them. In most cases, however, the data elements are of the same or compatible types, allowing them to be processed in the same manner.
Arrays of elements of a specific non-VARIANT type are not supported by Snowflake.
The number of elements in a Snowflake ARRAY is not specified when it is declared. ARRAY APPEND, for example, can cause an ARRAY to grow dynamically. Fixed-size arrays are not currently supported by Snowflake.
NULL values can be found in an ARRAY. The maximum combined size of all values in an ARRAY is theoretically 16 MB. ARRAYs, on the other hand, have internal overhead. Depending on the number and values of the elements, the practical maximum data size is usually smaller.
Use IIS to manually insert VARIANT data (INSERT INTO … SELECT). An ARRAY of values is inserted into a table using the following code:
create table array_example (a array);
insert into array_example (a) select array_construct(12, 'twelve', null);
When one or more of the following conditions apply, an ARRAY is typically used:
- You have a lot of data, all of which is structured similarly or identically.
- Each piece of data should be treated in the same way. You could, for example, loop through the data and process each piece in the same way.
- The information is organised in a natural way, such as chronologically.
- “OBJECT” (can directly contain VARIANT, and thus indirectly contain any other data type, including itself):
A Snowflake OBJECT is the equivalent of a JSON “object.” The corresponding data type is often referred to as a “dictionary,” “hash,” or “map” in other programming languages.
An OBJECT is a collection of key-value pairs.
Different values (in different key-value pairs) may have different underlying data types because VARIANT can store any other data type. An OBJECT, for example, can store a VARCHAR and an INTEGER representing a person’s name and age.
Explicitly typed objects are not currently supported by Snowflake.
The key should not be an empty string, and neither the key nor the value should be NULL in a key-value pair.
An OBJECT can be up to 16 MB in length. Semi-structured data can be stored in an OBJECT. Hierarchical data structures can be created using an OBJECT.
Use IIS to manually insert OBJECT data (INSERT INTO … SELECT). An OBJECT is inserted into a table with the following code:
create table object_example (o object);
insert into object_example (o) select object_construct('key', 'value 1'::variant);
When one or more of the following are true, an OBJECT is typically used:
- You’ve got a bunch of data that’s all linked together by strings. If you want to look up information by province name, for example, you could use an OBJECT.
- You want to keep information about the data with the data; the names (keys) should be meaningful, not just unique identifiers.
- There is no natural order to the information, or the order can be inferred solely from the keys.
- The data may have a different structure or be incomplete. For example, if you want to create a book catalogue that typically includes the title, author name, and publication date, but the publication date in some cases is unknown, you could use an OBJECT.
These data types are frequently referred to as semi-structured data types. OBJECT is the only one of these data types that has all of the characteristics of a true semi-structured data type all by itself. Combining these data types, on the other hand, allows you to represent arbitrary hierarchical data structures explicitly, which can be used to load and operate on semi-structured data (e.g. JSON, Avro, ORC,
What are the Snowflake Semi-structured Data Functions?
These functions work with semi-structured data (such as JSON, Avro, and XML), which is typically stored in VARIANT, OBJECT, or ARRAY columns in Snowflake.
The functions are divided into categories based on the type of operation performed:
- Data parsing in JSON and XML.
- Arrays and objects are created and manipulated.
- Value extraction from semi-structured data.
- Converting semi-structured data to and from different data types.
- Identifying the data type of semi-structured data values (i.e. type predicates).
How to Load Snowflake Semi-Structured Data?
Semi-structured data is loaded into tables in the same way that structured data is loaded into tables. You can, however, explicitly specify all, some, or none of the structure when loading and storing semi-structured data:
- You can load data into an OBJECT column if it’s a set of key-value pairs.
- If your data is an array, you can store it in an ARRAY column.
- You can do one of two things with Hierarchical Data:
- Split the data into several columns. You have the option of:
- Exact column extraction and transformation from semi-structured data into separate columns in target tables.
- Snowflake can be used to automatically detect and retrieve column definitions from semi-structured data files that have been staged. Column definitions can be used to make Snowflake tables, external tables, or views. Create tables with column definitions automatically retrieved from the staged files to save time.
- Store the information in a single VARIANT column. You have the option to:
- Clearly define the structure (e.g. specify a hierarchy of VARIANT, ARRAY, and OBJECT data types).
- Load the data without specifying the structure explicitly. The data is converted to an internal data format that uses Snowflake VARIANT, ARRAY, and OBJECT data types if you specify a data format that Snowflake recognizes and parses (JSON, Avro, Parquet, or ORC).
You can use more than one of the preceding techniques if the data is complex or a single value requires more than about 16MB of storage space. You can, for example, divide the data into multiple columns, with some of those columns containing a hierarchy of data types that is explicitly specified.
You can query data that has been stored in the ARRAY, OBJECT, or VARIANT data types, or a hierarchy of those types.
How to Store Snowflake Semi-Structured Data?
The following Snowflake data types are commonly used to store semi-structured data:
- ARRAY is a language construct that is similar to an array in other languages.
- OBJECT: a “dictionary,” “hash,” or “map” in many languages, similar to a JSON object. This is a collection of key-value pairs.
- VARIANT: a data type that can hold any other data type’s value (including ARRAY and OBJECT). VARIANT allows you to create and store hierarchical data.
(If the data is split into multiple columns, some or all of them can be simple data types like FLOAT, VARCHAR, and so on.)
The data types ARRAY, OBJECT, and VARIANT can be used separately or nested to create a hierarchy.
You can make an ARRAY, for example, and each element of the ARRAY can contain an OBJECT. Snowflake can build the hierarchy for you and store it in a VARIANT if the data is in JSON, Avro, ORC, or Parquet format. You can also manually create a hierarchy.
Snowflake converts the data to an optimized internal storage format that uses ARRAY, OBJECT, and VARIANT, regardless of how the hierarchy was built. SQL querying is quick and easy with this internal storage format.
Read More About: Replicate Data from Square to Snowflake
Conclusion
In summary, you can load, store, and actually use Snowflake semi-structured data in minutes. Snowflake supports all of the most common Snowflake semi-structured data types and enables you to query the data with standard SQL that you probably already know and use. Best of all, you can combine with your existing data sources enabling you to make the comparisons that generate insights in the information you make so much effort to record.
Extracting complex data from a diverse set of data sources can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 150+ Data Sources like Databases or SaaS applications into your Data Warehouses such as Snowflake, BigQuery, Redshift to be visualized in a BI tool of your choice. Hevo is fully automated and hence does not require you to code.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.
Dimple is an experienced Customer Experience Engineer with four years of industry proficiency, including the last two years at Hevo, where she has significantly refined customer experiences within the innovative data integration platform. She is skilled in computer science, databases, Java, and management. Dimple holds a B.Tech in Computer Science and excels in delivering exceptional consulting services. Her contributions have greatly enhanced customer satisfaction and operational efficiency.