How To Handle Snowflake Data Lineage: Simplified 101

on ACCESS_HISTORY, Data Integration, data lineage, Data Warehouses, Snowflake, SQL • March 22nd, 2022

Snowflake Data Lineage

Companies use Data Warehouses for storing and analyzing their data. The data is increasing at a staggering rate and it is essential to organize this data in an efficient manner. Business data plays an important role in delivering data-driven decisions. Snowflake is one of the widely used Cloud Data Warehouse service providers that help companies analyze data with ease.

As the data increase in the Data Warehouse, it becomes more difficult to keep track of data points. Snowflake Data Lineage allows users to have better visibility of data movement and keep track of all the activities happening to data. With the help of Snowflake Data Lineage, users can know the origin of the data, what happens to data, and where it is going. 

Snowflake Data Lineage helps in tracking back the error. In this article, you learn about Snowflake Data Lineage and how it helps users keep track of data. Also, read about ACCESS_HISTORY in Snowflake Data Lineage.

Table of Contents

What is Snowflake?

Snowflake Logo
Image Source

Snowflake is a Data Warehousing solution. Since it separates storage and computing, it allows you to resize your database if the workload is high, and even pause it if there is no workload. It also provides you with automatic concurrency, allowing seamless scaling in case of spiky workloads. You can choose AWS, Azure, or GCP as your cloud resources provider, and also select the region of your choice. 

Like several other Data Warehouses, Snowflake organizes data in columns instead of rows, to enable parallel processing. However, all that happens under the hood. End users like us can use SQL for querying and processing data. As the size of your data and the number of processes you run increases, you feel the need to start tracking your data lifecycle (how the data was added, how it was transformed, what changes it went through, and so on).

This is where Snowflake Data Lineage comes in. Now, Snowflake doesn’t exactly have data lineage embedded (there are proprietary third-party tools to do this; a python package is also available to help you implement Snowflake Data Lineage yourself). However, it has a helpful view that can serve as a proxy to Snowflake Data Lineage, or help you derive the Snowflake Data Lineage: ACCESS_HISTORY

Key Features of Snowflake

Some of the main features of Snowflake are listed below:

  • Data Sharing: Snowflake allows organization users to easily share data with other Snowflake accounts and consume data provided by other Snowflake accounts.
  • Standard SQL Support: Snowflake supports most DDL and DML in SQL. Moreover, it also has the most advanced DML in Snowflake. 
  • Scalability: The multi-cluster architecture of the Snowflake separates the storage and compute which allow users to easily scale up and down both independently as per their needs.

To know more about Snowflake, click here.

Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ data sources and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

Get Started with Hevo for Free

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  1. Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  2. Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  3. Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  4. Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  5. Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  6. Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, E-Mail, and support calls.
  7. Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

ACCESS_HISTORY in Snowflake

When you create a Snowflake account, you should see two databases created by default: SNOWFLAKE and SNOWFLAKE_SAMPLE_DATA. Within the SNOWFLAKE database, under the ACCOUNT_USAGE schema, you should find the ACCESS_HISTORY view for Snowflake Data Lineage, as shown in the image below.

ACCESS_HISTORY view for Snowflake Data Lineage
Image Source: Self

This view has 6 columns:

  1. QUERY_ID: An internal ID used to identify the query. This can be used to join the QUERY_HISTORY view (in the same schema) in place of Snowflake Data Lineage. The QUERY_HISTORY view also contains some useful information, like the query type, query text, number of rows inserted, number of rows updated, etc.
  2. QUERY_START_TIME: The time (UTC) at which the query started execution
  3. USER_NAME: The user who ran the query
  4. DIRECT_OBJECTS_ACCESSED: This is a JSON column. The JSON contains the tables, views, and columns directly named in the query, either explicitly, or through a shortcut like an asterisk (*) for example.
  5. BASE_OBJECTS_ACCESSED: This, again, is a JSON column. Over here, the JSON contains the tables and columns of the base data. So, if your query is on a view, the JSON here will contain tables and columns that form the original data source for that view.
  6. OBJECTS_MODIFIED: A JSON column again. This is used for write operations, unlike the previous two JSON columns that are used for reading operations. This contains the tables, views, and columns that were modified in a write operation.

You can read more about the ACCESS_VIEW columns here. So, with this table, you can answer the following questions:

  • Which query modified what data?
  • Who ran that query?
  • When was the query executed?

Examples for Snowflake Data Lineage

As you can see, these questions (rather, their answers), help you determine the data lifecycle. All you need to do is construct the SQL query on this view correctly. Let’s consider an example. We will do the following:

  1. Create a table
  2. Add some rows to it
  3. Read data
  4. Update some rows

We will then query the ACCESS_HISTORY table for Snowflake Data Lineage, and try to get this exact sequence of events from the table.

Create a new database in your snowflake account (I’ll call it MY_DATABASE). Now, we’ll create a table in this schema. Run the following query:

CREATE TABLE students(
    name varchar,
    roll_no varchar
)

Now, let us make some operations on this table. Run the following queries one after another:

INSERT INTO students VALUES ('Yash',26),
('Isha',56) 
SELECT * from students
UPDATE students SET roll_no = 25 where name = 'Yash'

As you can see, we have an INSERT, followed by a reading, followed by an UPDATE.

Now, let’s see if we can get this sequence of events from the ACCESS_HISTORY view. Before that, let’s just examine the ACCESS_HISTORY view in place of Snowflake Data Lineage. Run the following query:

SELECT * from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY

You shall see the following output:

Output of Query - Snowflake Data Lineage
Image Source: Self

As you can see, there are some queries run by WORKSHEETS_APP_USER (these queries deal with Snowflake’s internal objects). Then, you have queries identified by your username (YASHS11 in my case). These are queries that you have made. If you look at these queries and see the JSON columns, you will see an item like the following:

[
  {
    "columns": [
      {
        "columnId": 1026,
        "columnName": "ROLL_NO"
      },
      {
        "columnId": 1025,
        "columnName": "NAME"
      }
    ],
    "objectDomain": "Table",
    "objectId": 1026,
    "objectName": "MY_DATABASE.PUBLIC.STUDENTS"
  }
]

As you can see, it is an ARRAY of JSON objects. Each JSON object in that array lists the objectId, objectDomain (essentially the type of the object), the objectName (the table name in the above example), and the relevant columns (read or modified). The columns field is also a JSON array, with each JSON within the array containing the columnId and the columnName

Now, with this information, we can try to find the history of actions taken on the students table.

Run the following query:

SELECT * from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY
WHERE base_objects_accessed[0].objectName = 'MY_DATABASE.PUBLIC.STUDENTS'
OR objects_modified[0].objectName = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc
The output of Query - Snowflake Data Lineage
Image Source: Self

You will see 3 rows in the output, corresponding to the following operations (in descending order of time):

  1. UPDATE
  2. READ
  3. INSERT

As you can see, DIRECT_OBJECTS_ACCESSED and BASE_OBJECTS_ACCESSED are empty JSONs in case of the INSERT statement, and OBJECTS_MODIFIED is empty in case of the READ statement.

You may have noted that you created the above query. You made use of the fact that the base_objects_accessed or the objects_modified arrays had only one entry for the queries of our concern. Therefore, You conveniently took the 0th index of these arrays.

However, that may not always be the case. You may have several queries that involve multiple tables and/or views. So, how do you deal with that? Below is one example of a query that can handle this (please note that it requires knowledge of LATERAL, FLATTEN, ARRAY_SIZE, and CASE from the Snowflake SQL syntax):

SELECT t.*
from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t,
LATERAL FLATTEN(CASE WHEN ARRAY_SIZE(t.base_objects_accessed)>0 then t.base_objects_accessed ELSE t.objects_modified END) b
WHERE b.value:"objectName" = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc

This gives the same output as before.

We are essentially breaking down the JSON array into multiple rows, each containing one array element (if base_objects_modified is an empty array, we consider the objects_modified array). We then laterally join this to our access_history table and run the WHERE query on the flattened table.

You can even join the query_history table to determine the query that actually accessed the table.

SELECT t.*, q.query_text
from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t
LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY q on q.query_id = t.query_id,
LATERAL FLATTEN(CASE WHEN ARRAY_SIZE(t.base_objects_accessed)>0 then t.base_objects_accessed ELSE t.objects_modified END) b
WHERE b.value:"objectName" = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc

This will showcase the query_text along with the other items of the ACCESS_HISTORY table. 

Conclusion

In this article, you learnt about the ACCESS_HISTORY view in Snowflake and how it can be used as a proxy in place of Snowflake Data Lineage. We saw an example of making certain operations on a table, and we saw how those operations were extracted from the ACCESS_HISTORY view in the exact order in which they were performed. We also saw how to fetch the exact query that made the changes in the table, using the QUERY_HISTORY view.

Visit our Website to Explore Hevo

Companies have business data available in multiple sources, and it’s a tedious process to load data manually from data sources to Snowflake. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to the desired Snowflake. It fully automates the process to load and transform data from 100+ data sources to a destination of your choice without writing a single line of code. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Snowflake Data Lineage in the comments section below!

Data Engineering
Survey 2022
Calling all data engineers – fill out this short survey to help us build an industry report for our data engineering community.
TAKE THE SURVEY
Amazon Gift Cards of $25 each are on offer for all valid and complete survey submissions.