Querying Snowflake Semi-Structured Data: 2 Critical Aspects

on Data Warehouses, Snowflake • January 20th, 2022

Snowflake Semi-Structured Data FI | Hevo Data

Data is empowering; it can help transform your business. To enable that transformation, businesses are collecting and storing as much data as they can. But some of this data such as emails, spreadsheets, and IOT logs comes in a semi-structured state, and many businesses are making the mistake of transforming it to a structured format way too early in its lifecycle. This adds extra cost constraints and also means that they lose the ability to work with atomic data when needed.

This has seen more and more businesses adopting Snowflake — a blazing fast serverless, multi-cloud data warehouse. One of the best things about Snowflake is that you can load and query Snowflake semi-structured data in a way that Snowflake becomes sort of a mixture between a data lake and a data warehouse.

Businesses should store the raw data as is in an inexpensive form (like a cloud storage bucket) and then add structure to it later as the need arises. This ensures that the business remains responsive and agile to its BI needs without compromising on data fidelity because this data may be required in its original format further down the road.

This page details how to query Snowflake’s semi-structured data. It also details how you can cast data types and then use the LATERAL and FLATTEN functions to access individual fields.

Table of Contents

What is Snowflake?

Snowflake Semi-structured Data: Snowflake Logo | Hevo Data
Image Source

Snowflake is a fully-managed cloud data warehouse that is designed to help you modernize your analytics strategy. Using standard SQL, Snowflake allows you to start querying Gigabytes and Petabytes of both structured and semi-structured data in seconds.

Snowflake automatically harnesses thousands of CPU cores to quickly execute queries for you. You can even query streaming data from your web, mobile apps, or IoT devices in real-time.

Snowflake is secure and meets the most secure regulatory standards such as HIPAA, FedRAMP, and PCI DSS. Data is encrypted in transit and at rest by default and it’s automatically replicated, restored, and backed up to ensure business continuity.

Interacting with Snowflake is easy. You can use the web-based UI, Snowflake’s command-line tool, or use their API with client libraries of your choice. When it comes to visualizing your data, Snowflake partners with some popular BI tools such as Tableau and Looker to help you turn complex data into compelling stories.

Snowflake Semi-structured Data: Snowflake Architecture | Hevo Data
Image Source

Key Features of Snowflake

The Snowflake platform is a SaaS-based analytical, versatile, and intuitive Data Warehouse service. It offers a Data Warehouse that is faster, easier to use, and significantly more customizable than traditional Data Warehouse solutions. Some of the key features of Snowflake are:

  • High Processing Speed: Each Virtual Warehouse in Snowflake is linked to an MPP (Massive Parallel Processing) cluster, which allows jobs to run in parallel without affecting the performance of other clusters.
  • Scalable: Snowflake is a secure, scalable architecture that allows you to create an unlimited number of Virtual Warehouses in a couple of minutes. You can run multiple jobs in parallel without having to worry about performance or memory management.
  • Separate Storage and Compute Layer: Snowflake employs separate storage and compute layers that can be scaled up and down independently of one another.
  • Disaster Recovery: Snowflake replicates data three times (by default) across availability zones and regions, ensuring that the system is completely fail-safe and fault-tolerant.
  • Pay Per Use Model: Snowflake is available on a pay-per-use basis, which means you only pay for the time you utilize the service. To save money, you might shut down the Warehouses once the execution is finished.

To explore more about Snowflake, visit the official website here.

Simplify Snowflake ETL using Hevo’s No-code Data Pipelines

A fully managed No-code Data Pipeline platform like Hevo helps you integrate data from 100+ data sources (including 40+ Free Data Sources) to a destination of your choice such as Snowflake in real-time in an effortless manner. Hevo with its minimal learning curve can be set up in just a few minutes allowing the users to load data without having to compromise performance. Its strong integration with umpteenth sources provides users with the flexibility to bring in data of different kinds, in a smooth fashion without having to code a single line. 

Get Started with Hevo for Free

Check out some of the cool features of Hevo:

  • Completely Automated: The Hevo platform can be set up in just a few minutes and requires minimal maintenance.
  • Real-Time Data Transfer: Hevo provides real-time data migration, so you can have analysis-ready data always.
  • 100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure ensures reliable data transfer with zero data loss.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources that can help you scale your data infrastructure as required.
  • 24/7 Live Support: The Hevo team is available round the clock to extend exceptional support to you through chat, email, and support calls.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Live Monitoring: Hevo allows you to monitor the data flow so you can check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

What is Semi-Structured Data?

Semi-structured data is a type of structured data that does not adhere to the tabular structure of data models associated with relational databases or other types of data tables, but still includes tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. As a result, it’s also referred to as a self-descriptive structure.
Even though they are grouped together in semi-structured data, entities belonging to the same class may have different attributes, and the order of the attributes is irrelevant.
Since the advent of the Internet, where full-text documents and databases are no longer the only forms of data and different applications require a medium for exchanging information, semi-structured data has become more common. Semi-structured data is common in object-oriented databases.

Data can come from a variety of places, including apps, sensors, and mobile phones. Semi-structured data formats (such as JSON, Avro, ORC, Parquet, and XML) have become popular standards for transporting and storing data to support these various data sources.

Snowflake has built-in support for semi-structured data loading, storage, and querying.

Why Query Semi-Structured Data?

Snowflake Semi-structured Data: Semi-Structured Data Illustration | Hevo Data
Image Source

Semi-structured data has become far more common in recent years as organizations embrace applications in systems that output data in file formats like JSON, Avro, ORC, and Parquet. The problem is that there are few easy and fast storage platforms that allow you to store and analyze semi-structured data at the same time.

Often semi-structured data files just get dumped into NoSQL tools like Hadoop and are forgotten because it’s too hard to query the data and use it performantly. Traditional relational databases and their NoSQL options:

  • Have had limited support for semi-structured data
  • Require hours of custom work to transform semi-structured data
  • Require specialized NoSQL skills and knowledge
  • Are siloed and disparate
  • Are extremely slow.

This pushes many organizations to try and transform their data and place it in a relational database which takes even more effort and negates the advantages of semi-structured data.

Extrapolating even further, it’s often difficult to combine semi-structured data with existing business data so potential information and insights are often missed. Snowflake on the other hand completely supports these file formats and allows you to use them and query them without prior transformation using special SQL functions and operators as if they are structured datasets.

Snowflake Semi-structured Data: Information Extraction | Hevo Data
Image Source

What is Supported File Formats for Semi-structured Data?

  • JSON:

Libraries (usually with extensions) are used to generate JSON data in non-JavaScript applications. Generators of ad hoc JavaScript.

JSON document concatenation (which may or may not be line-separated). There are significant differences between different implementations due to the lack of a formal specification. If the JSON parser is strict in its language definition, these differences make importing JSON-like data sets impossible. Snowflake follows the rule “be liberal in what you accept” to make importing JSON data sets as trouble-free as possible. The goal is to accept as many JSON and JSON-like inputs as possible, allowing for unambiguous interpretation.

JSON data is a collection of name/value pairs organized into objects and arrays in a hierarchical structure:

  • In name/value pairs, colons are used to separate names and values.
  • Objects are indicated by curly braces.
  • Arrays are denoted by square brackets [].
  • Separate entities in objects and arrays with commas.

Examples of JSON Documents

{"root":[{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}]}

There are three simple employee records in this array.


{"root":
   [
    { "kind": "person",
      "fullName": "John Doe",
      "age": 22,
      "gender": "Male",
      "phoneNumber":
        {"areaCode": "206",
         "number": "1234567"},
      "children":
         [
           {
             "name": "Jane",
             "gender": "Female",
             "age": "6"
           },
           {
              "name": "John",
              "gender": "Male",
              "age": "15"
           }
         ],
      "citiesLived":
         [
            {
               "place": "Seattle",
               "yearsLived": ["1995"]
            },
            {
               "place": "Stockholm",
               "yearsLived": ["2005"]
            }
         ]
      },
      {"kind": "person", "fullName": "Mike Jones", "age": 35, "gender": "Male", "phoneNumber": { "areaCode": "622", "number": "1567845"}, "children": [{ "name": "Earl", "gender": "Male", "age": "10"}, {"name": "Sam", "gender": "Male", "age": "6"}, { "name": "Kit", "gender": "Male", "age": "8"}], "citiesLived": [{"place": "Los Angeles", "yearsLived": ["1989", "1993", "1998", "2002"]}, {"place": "Washington DC", "yearsLived": ["1990", "1993", "1998", "2008"]}, {"place": "Portland", "yearsLived": ["1993", "1998", "2003", "2005"]}, {"place": "Austin", "yearsLived": ["1973", "1998", "2001", "2005"]}]},
      {"kind": "person", "fullName": "Anna Karenina", "age": 45, "gender": "Female", "phoneNumber": { "areaCode": "425", "number": "1984783"}, "citiesLived": [{"place": "Stockholm", "yearsLived": ["1992", "1998", "2000", "2010"]}, {"place": "Russia", "yearsLived": ["1998", "2001", "2005"]}, {"place": "Austin", "yearsLived": ["1995", "1999"]}]}
    ]
}

There are three employee records (objects) in this array, as well as their dependent data.

  • Avro: Avro is an open-source data serialisation and RPC framework that was built with Apache Hadoop in mind. It generates serialised data in a compact binary format by using JSON schemas. Because the schema is included in the data, the serialised data can be sent to any destination (i.e., an application or programme) and easily deserialized there.

An Avro schema is made up of a JSON string, object, or array that specifies the schema type and data attributes (field names, data types, and so on). Depending on the schema type, the attributes are different. Arrays and maps are examples of complex data types that are supported.

Avro data is read into a single VARIANT column by Snowflake. You can use the same commands and functions to query the data in a VARIANT column as you would for JSON data.

Example of an Avro Schema

{
 "type": "record",
 "name": "person",
 "namespace": "example.avro",
 "fields": [
     {"name": "fullName", "type": "string"},
     {"name": "age",  "type": ["int", "null"]},
     {"name": "gender", "type": ["string", "null"]}
     ]
}
  • ORC: The ORC (Optimized Row Columnar) file format, which is used to store Hive data, was created with the goal of providing better compression and performance for reading, writing, and processing data than previous Hive file formats. ORC data is read into a single VARIANT column in Snowflake. You can use similar commands and functions to query the data in a VARIANT column as you would JSON data.

You can also use the CREATE TABLE AS SELECT statement to extract columns from a staged ORC file into separate table columns.

The ORC file format is a binary one.

Example of ORC Data Loaded into a VARIANT Column

+--------------------------------------+
| SRC                                  |
|--------------------------------------|
| {                                    |
|   "boolean1": false,                 |
|   "byte1": 1,                        |
|   "bytes1": "0001020304",            |
|   "decimal1": 12345678.654745,       |
|   "double1": -1.500000000000000e+01, |
|   "float1": 1.000000000000000e+00,   |
|   "int1": 65536,                     |
|   "list": [                          |
|     {                                |
|       "int1": 3,                     |
|       "string1": "good"              |
|     },                               |
|     {                                |
|       "int1": 4,                     |
|       "string1": "bad"               |
|     }                                |
|   ]                                  |
| }                                    |
+--------------------------------------+
  • Parquet: Parquet is a columnar data representation designed for Hadoop projects that is compressed and efficient. Dremel record shredding and assembly algorithms are used in the file format, which supports complex nested data structures. Parquet data is read into a single VARIANT column by Snowflake. You can use the same commands and functions to query the data in a VARIANT column as you would for JSON data. The schema of Parquet files created with the Parquet writer v1 is supported by Snowflake. Data loading is not supported for files created with the writer version 2.Alternatively, you can use a CREATE TABLE AS SELECT statement to extract selected columns from a staged Parquet file into separate table columns. The Parquet format is a binary one.
| SRC                                      |
|------------------------------------------|
| {                                        |
|   "continent": "Europe",                 |
|   "country": {                           |
|     "city": {                            |
|       "bag": [                           |
|         {                                |
|           "array_element": "Paris"       |
|         },                               |
|         {                                |
|           "array_element": "Nice"        |
|         },                               |
|         {                                |
|           "array_element": "Marseilles"  |
|         },                               |
|         {                                |
|           "array_element": "Cannes"      |
|         }                                |
|       ]                                  |
|     },                                   |
|     "name": "France"                     |
|   }                                      |
| }                                        |
+------------------------------------------+
  • XML: The markup language XML (Extensible Markup Language) defines a set of rules for encoding documents. It was based on SGML, another markup language that was created to standardize the structure and elements of a document.

Since its initial focus on documents, XML has expanded to include a wide range of applications, including the representation of arbitrary data structures and serving as the base language for communication protocols. It has become one of the most widely used standards for data interchange on the Web due to its extensibility, versatility, and usability.

The following constructs make up the majority of an XML document:

  • Tags (identified by angle brackets, < and >)
  • Elements

Elements are typically made up of a “start” tag and a “end” tag, with the text between the tags serving as the element’s content. An element can also be made up entirely of “empty-element” tags with no “end” tags. The “start” and “empty-element” tags can have attributes that help define the element’s characteristics or metadata.

What are the Semi-structured Data Types?

Other data types can be contained within the Snowflake data types listed below:

  • “VARIANT” (can contain any other data type):

Any other type of value, including OBJECT and ARRAY, can be stored in a VARIANT.

A VARIANT can be up to 16 MB in length. You can use the “CAST” function, the “TO VARIANT” function, or the “::” operator to explicitly cast a value to or from the VARIANT data type (e.g. expression::variant). Use IIS to directly insert VARIANT data (INSERT INTO … SELECT).

The following example demonstrates how to add JSON-formatted data to a VARIANT:

insert into varia (v) 
    select to_variant(parse_json('{"key3": "value3", "key4": "value4"}'));

Typically, a VARIANT is used when:

  • You want to create hierarchical data by defining a hierarchy with two or more ARRAYs or OBJECTs explicitly.
  • You want to load JSON, Avro, ORC, or Parquet data without having to specify the data’s hierarchical structure.

Snowflake can convert data from JSON, Avro, ORC, or Parquet into an internal hierarchy of ARRAY, OBJECT, and VARIANT data and store it directly in a VARIANT. Although you can create the data hierarchy manually, it is usually more convenient to have Snowflake do it for you.

  • “ARRAY” (can directly contain VARIANT, and thus indirectly contain any other data type, including itself): An array in many other programming languages is similar to a Snowflake ARRAY. There are 0 or more data items in an ARRAY. The array’s position is used to access each element.

VARIANT is the type of each value in an ARRAY. (A VARIANT can have any data type as a value.)
Casting values of other data types to VARIANT and storing them in an array is possible. ARRAY CONSTRUCT, for example, can implicitly cast values to VARIANT.

The underlying data types of the values in an ARRAY do not have to be identical because ARRAYs store VARIANT values, and VARIANT values can store other data types within them. In most cases, however, the data elements are of the same or compatible types, allowing them to be processed in the same manner.

Arrays of elements of a specific non-VARIANT type are not supported by Snowflake.

The number of elements in a Snowflake ARRAY is not specified when it is declared. ARRAY APPEND, for example, can cause an ARRAY to grow dynamically. Fixed-size arrays are not currently supported by Snowflake.

NULL values can be found in an ARRAY. The maximum combined size of all values in an ARRAY is theoretically 16 MB. ARRAYs, on the other hand, have internal overhead. Depending on the number and values of the elements, the practical maximum data size is usually smaller.

Use IIS to manually insert VARIANT data (INSERT INTO … SELECT). An ARRAY of values is inserted into a table using the following code:

create table array_example (a array);
insert into array_example (a) select array_construct(12, 'twelve', null);

When one or more of the following conditions apply, an ARRAY is typically used:

  • You have a lot of data, all of which is structured similarly or identically.
  • Each piece of data should be treated in the same way. You could, for example, loop through the data and process each piece in the same way.
  • The information is organised in a natural way, such as chronologically.
  • “OBJECT” (can directly contain VARIANT, and thus indirectly contain any other data type, including itself):

A Snowflake OBJECT is the equivalent of a JSON “object.” The corresponding data type is often referred to as a “dictionary,” “hash,” or “map” in other programming languages.

An OBJECT is a collection of key-value pairs.

Different values (in different key-value pairs) may have different underlying data types because VARIANT can store any other data type. An OBJECT, for example, can store a VARCHAR and an INTEGER representing a person’s name and age.

Explicitly typed objects are not currently supported by Snowflake.

The key should not be an empty string, and neither the key nor the value should be NULL in a key-value pair.

An OBJECT can be up to 16 MB in length. Semi-structured data can be stored in an OBJECT. Hierarchical data structures can be created using an OBJECT.

Use IIS to manually insert OBJECT data (INSERT INTO … SELECT). An OBJECT is inserted into a table with the following code:

create table object_example (o object);
insert into object_example (o) select object_construct('key', 'value 1'::variant);

When one or more of the following are true, an OBJECT is typically used:

  • You’ve got a bunch of data that’s all linked together by strings. If you want to look up information by province name, for example, you could use an OBJECT.
  • You want to keep information about the data with the data; the names (keys) should be meaningful, not just unique identifiers.
  • There is no natural order to the information, or the order can be inferred solely from the keys.
  • The data may have a different structure or be incomplete. For example, if you want to create a book catalogue that typically includes the title, author name, and publication date, but the publication date in some cases is unknown, you could use an OBJECT.

These data types are frequently referred to as semi-structured data types. OBJECT is the only one of these data types that has all of the characteristics of a true semi-structured data type all by itself. Combining these data types, on the other hand, allows you to represent arbitrary hierarchical data structures explicitly, which can be used to load and operate on semi-structured data (e.g. JSON, Avro, ORC,

What are the Snowflake Semi-structured Data Functions?

These functions work with semi-structured data (such as JSON, Avro, and XML), which is typically stored in VARIANT, OBJECT, or ARRAY columns in Snowflake.

The functions are divided into categories based on the type of operation performed:

  • Data parsing in JSON and XML.
  • Arrays and objects are created and manipulated.
  • Value extraction from semi-structured data.
  • Converting semi-structured data to and from different data types.
  • Identifying the data type of semi-structured data values (i.e. type predicates).
Sub-categoryFunctionNotes
JSON and XML ParsingCHECK_JSON
CHECK_XMLPreview feature.
JSON_EXTRACT_PATH_TEXT
PARSE_JSON
PARSE_XMLPreview feature.
STRIP_NULL_VALUE
Array/Object Creation and ManipulationARRAY_AGGSee also Aggregate Functions.
ARRAY_APPEND
ARRAY_CAT
ARRAY_COMPACT
ARRAY_CONSTRUCT
ARRAY_CONSTRUCT_COMPACT
ARRAY_CONTAINS
ARRAY_INSERT
ARRAY_INTERSECTION
ARRAY_POSITION
ARRAY_PREPEND
ARRAY_SIZE
ARRAY_SLICE
ARRAY_TO_STRING
ARRAYS_OVERLAP
OBJECT_AGGSee also Aggregate Functions.
OBJECT_CONSTRUCT
OBJECT_CONSTRUCT_KEEP_NULL
OBJECT_DELETE
OBJECT_INSERT
OBJECT_PICK
ExtractionFLATTENTable function.
GET
GET_IGNORE_CASE
GET_PATH , :Variation of GET.
OBJECT_KEYSExtracts keys from key/value pairs in OBJECT.
XMLGETPreview feature.
Conversion/CastingAS_<object_type>
AS_ARRAY
AS_BINARY
AS_CHAR , AS_VARCHAR
AS_DATE
AS_DECIMAL , AS_NUMBER
AS_DOUBLE , AS_REAL
AS_INTEGER
AS_OBJECT
AS_TIME
AS_TIMESTAMP_*
STRTOK_TO_ARRAY
TO_ARRAY
TO_JSON
TO_OBJECT
TO_VARIANT
TO_XML
Type PredicatesIS_<object_type>
IS_ARRAY
IS_BOOLEAN
IS_BINARY
IS_CHAR , IS_VARCHAR
IS_DATE , IS_DATE_VALUE
IS_DECIMAL
IS_DOUBLE , IS_REAL
IS_INTEGER
IS_NULL_VALUE
IS_OBJECT
IS_TIME
IS_TIMESTAMP_*
TYPEOF

How to Load Snowflake Semi-Structured Data?

Semi-structured data is loaded into tables in the same way that structured data is loaded into tables. You can, however, explicitly specify all, some, or none of the structure when loading and storing semi-structured data:

  • You can load data into an OBJECT column if it’s a set of key-value pairs.
  • If your data is an array, you can store it in an ARRAY column.
  • You can do one of two things with Hierarchical Data:
    • Split the data into several columns. You have the option of:
      • Exact column extraction and transformation from semi-structured data into separate columns in target tables.
      • Snowflake can be used to automatically detect and retrieve column definitions from semi-structured data files that have been staged. Column definitions can be used to make Snowflake tables, external tables, or views. Create tables with column definitions automatically retrieved from the staged files to save time.
  • Store the information in a single VARIANT column. You have the option to:
    • Clearly define the structure (e.g. specify a hierarchy of VARIANT, ARRAY, and OBJECT data types).
    • Load the data without specifying the structure explicitly. The data is converted to an internal data format that uses Snowflake VARIANT, ARRAY, and OBJECT data types if you specify a data format that Snowflake recognizes and parses (JSON, Avro, Parquet, or ORC).

You can use more than one of the preceding techniques if the data is complex or a single value requires more than about 16MB of storage space. You can, for example, divide the data into multiple columns, with some of those columns containing a hierarchy of data types that is explicitly specified.

You can query data that has been stored in the ARRAY, OBJECT, or VARIANT data types, or a hierarchy of those types.

How to Store Snowflake Semi-Structured Data?

The following Snowflake data types are commonly used to store semi-structured data:

  • ARRAY is a language construct that is similar to an array in other languages.
  • OBJECT: a “dictionary,” “hash,” or “map” in many languages, similar to a JSON object. This is a collection of key-value pairs.
  • VARIANT: a data type that can hold any other data type’s value (including ARRAY and OBJECT). VARIANT allows you to create and store hierarchical data.

(If the data is split into multiple columns, some or all of them can be simple data types like FLOAT, VARCHAR, and so on.)

The data types ARRAY, OBJECT, and VARIANT can be used separately or nested to create a hierarchy.

You can make an ARRAY, for example, and each element of the ARRAY can contain an OBJECT. Snowflake can build the hierarchy for you and store it in a VARIANT if the data is in JSON, Avro, ORC, or Parquet format. You can also manually create a hierarchy.

Snowflake converts the data to an optimized internal storage format that uses ARRAY, OBJECT, and VARIANT, regardless of how the hierarchy was built. SQL querying is quick and easy with this internal storage format.

How to Query Snowflake Semi-structured Data?

Prerequisites

  • Access to a Snowflake account
  • Install the Snowflake CLI
  • Run the following script to create the resources you’ll need for this guide:
create or replace warehouse devwarehouse with
  warehouse_size='X-SMALL'
  auto_suspend = 120
  auto_resume = true
  initially_suspended=true;

use warehouse devwarehouse;

Example 1: Querying Snowflake Semi-Structured Data on Non-nested JSON File

The JSON file in our example defines an object with 3 properties:

  • name
  • email
  • ip_address

The destination table will only have one column that will store the JSON object with the VARIANT data type. Use the following workbook to load the JSON file into a Snowflake table:

create database users;

use database users;

use schema users.public;

create or replace table json_user_data (src variant);

insert into json_user_data (src)
  select parse_json('{
  "name": "Sarah Banks",
  "email": "sarahbanks@census.gov",
  "ip_address": "28.56.176.2"
}, {
  "name": "Giovani Giorgio",
  "email": "ggiorgio@senate.gov",
  "ip_address": "237.189.4.217"
}, {
  "name": "Liam Nielsen",
  "email": "nielsen@imageshack.us",
  "ip_address": "150.33.122.255"
}, {
  "name": "Frederico Fred",
  "email": "ffred3@aol.com",
  "ip_address": "76.67.177.39"
}');

First, let’s run a query to see the contents of the src variant column in the json_user_data table:

select * from json_user_data;
SRC
{  “name”: “Sarah Banks”,  “email”: “sarahbanks@census.gov”,  “ip_address”: “28.56.176.2”}, {  “name”: “Giovani Giorgio”,  “email”: “ggiorgio@senate.gov”,  “ip_address”: “237.189.4.217”}, {  “name”: “Liam Nielsen”,  “email”: “nielsen@imageshack.us”,  “ip_address”: “150.33.122.255”}, {  “name”: “Frederico Fred”,  “email”: “ffred3@aol.com”,  “ip_address”: “76.67.177.39”}

You are now going to use the FLATTEN SQL function to convert every Key/Value pair that you have in your src VARIANT column into a row. This will form a relational table.

SELECT src:name::varchar AS NAME,
       src:email::varchar AS EMAIL, 
       src:ip_address AS IP_ADDRESS 
FROM json_user_data, 
LATERAL FLATTEN(input => users.src);

After running this SQL query you will see that it gives you a structured view of your data.

RowNAMEEMAILIP_ADDRESS
1Sarah Bankssarahbanks@census.gov28.56.176.2
2Giovani Giorgioggiorgio@senate.gov237.189.4.217
3Liam Nielsennielsen@imageshack.us150.33.122.255
4Frederico Fredffred3@aol.com76.67.177.39

Example 2: Querying Snowflake Semi-Structured Data on Nested JSON File

In this example, we will try to flatten a nested JSON file.

The following JSON dataset contains data about Grammy award-winning songs. Load the dataset into a new songs table and cast it within the VARIANT column using the following script:

create database songs;

use database songs;

create or replace table json_songs_data (src variant);

insert into json_songs_data (src)
  select parse_json('{
  "song_title":"I Can't Breathe",
  "year":2020,
  "artists": [
       {
       "first_name":"Dernst",
       "middle_name":"Emile",
       "last_name":"II"
       },
       {
       "first_name":"Gabriella",

"middle_name":"Sarmiento",
       "last_name":"Wilson"
       },
       {
       "first_name":"Tiara",
       "middle_name":"Nicole",
       "last_name":"Thomas"
       }
       ]
 }
{
  "song_title":"This is America",
  "year":2019,
  "artists": 
       [{
       "first_name":"Donald",
       "middle_name":null,
       "last_name":"Glover"
       },
       {
       "first_name":"Ludwig",

"middle_name":null,
    "last_name":"Göransson"
       },
       {
       "first_name":"Jeffery",
       "middle_name":"Lamar",
       "last_name":"Williams"
       }
       ]
 }
{
  "song_title":"That’s What I Like",
  "year":2018,
  "artists": 
       [{
       "first_name":"Peter",
       "middle_name":"Gene",
       "last_name":"Hernandez"
       }
       ]
 }
{
  "song_title":"Hello",
  "year":2017,
  "artists": 
       [{
       "first_name":"Adele",
       "middle_name":"Laurie",

"last_name":"Adkins"
       }
       ]
 },
{
  "song_title":"Thinking Out Loud",
  "year":2016,
  "artists": 
       [{
       "first_name":"Edward",
       "middle_name":"Christopher",
       "last_name":"Sheeran"
       }
       ]
 }');

The following query will flatten the NESTED JSON data so that each member of the array will become its own row. The lateral modifier on the other hand will join the data with any information outside the object.

SELECT 
    value:first_name::VARCHAR AS FIRST_NAME,
    value:middle_name::VARCHAR AS MIDDLE_NAME,
    value:last_name::VARCHAR AS LAST_NAME,
    src:song_title::VARCHAR AS SONG_TITLE,
    src:year::VARCHAR AS YEAR
FROM json_songs_data
,LATERAL FLATTEN(input => src:artists);

As expected, the SQL query results display a more structured view of the data. Flatten is taking all those elements and is creating new rows. Lateral on the other hand is joining it back to the original row and is producing those columns. In other words, flatten makes the table longer while lateral makes it wider.

RowFIRST_NAMEMIDDLE_NAMELAST_NAMESONG_TITLEYEAR
1DernstEmileIII Can’t Breathe2020
2GabriellaSarmientoWilson I Can’t Breathe2020
3TiaraNicoleThomasI Can’t Breathe2020
4DonaldNULLGloverThis is America2019
5LudwigNULLGöranssonThis is America2019
6JefferyLamarWilliamsThis is America2019
7PeterGeneHernandezThat’s What I Like2018
8AdeleLaurieAdkinsHello2017
9EdwardChristopherSheeranThinking Out Loud2016

Conclusion

In summary, you can load, store, and actually use Snowflake semi-structured data in minutes. Snowflake supports all of the most common Snowflake semi-structured data types and enables you to query the data with standard SQL that you probably already know and use. Best of all, you can combine your Snowflake semi-structured data with your existing data sources enabling you to make the comparisons that generate insights in the information you make so much effort to record.

Visit our Website to Explore Hevo

Extracting complex data from a diverse set of data sources can be challenging, and this is where Hevo saves the day! Hevo offers a faster way to move data from 100+ Data Sources like Databases or SaaS applications into your Data Warehouses such as Snowflake, BigQuery, Redshift to be visualized in a BI tool of your choice. Hevo is fully automated and hence does not require you to code.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Data Engineering
Survey 2022
Calling all data engineers – fill out this short survey to help us build an industry report for our data engineering community.
TAKE THE SURVEY
Amazon Gift Cards of $25 each are on offer for all valid and complete survey submissions.