Companies use Data Warehouses for storing and analyzing their data. The data is increasing at a staggering rate and it is essential to organize this data in an efficient manner. Business data plays an important role in delivering data-driven decisions. Snowflake is one of the widely used Cloud Data Warehouse service providers that help companies analyze data with ease.
As the data increase in the Data Warehouse, it becomes more difficult to keep track of data points. Snowflake Data Lineage allows users to have better visibility of data movement and keep track of all the activities happening to data. With the help of Snowflake Data Lineage, users can know the origin of the data, what happens to data, and where it is going.
Snowflake Data Lineage helps in tracking back the error. In this article, you learn about Snowflake Data Lineage and how it helps users keep track of data. Also, read about ACCESS_HISTORY in Snowflake Data Lineage.
What is Snowflake?
Snowflake is a Data Warehousing solution. Since it separates storage and computing, it allows you to resize your database if the workload is high, and even pause it if there is no workload. It also provides you with automatic concurrency, allowing seamless scaling in case of spiky workloads. You can choose AWS, Azure, or GCP as your cloud resources provider, and also select the region of your choice.
Like several other Data Warehouses, Snowflake organizes data in columns instead of rows, to enable parallel processing. However, all that happens under the hood. End users like us can use SQL for querying and processing data. As the size of your data and the number of processes you run increases, you feel the need to start tracking your data lifecycle (how the data was added, how it was transformed, what changes it went through, and so on).
This is where Snowflake Data Lineage comes in. Now, Snowflake doesn’t exactly have data lineage embedded (there are proprietary third-party tools to do this; a python package is also available to help you implement Snowflake Data Lineage yourself). However, it has a helpful view that can serve as a proxy to Snowflake Data Lineage, or help you derive the Snowflake Data Lineage: ACCESS_HISTORY.
Key Features of Snowflake
Some of the main features of Snowflake are listed below:
- Data Sharing: Snowflake allows organization users to easily share data with other Snowflake accounts and consume data provided by other Snowflake accounts.
- Standard SQL Support: Snowflake supports most DDL and DML in SQL. Moreover, it also has the most advanced DML in Snowflake.
- Scalability: The multi-cluster architecture of the Snowflake separates the storage and compute which allow users to easily scale up and down both independently as per their needs.
To know more about Snowflake, click here.
ACCESS_HISTORY in Snowflake
When you create a Snowflake account, you should see two databases created by default: SNOWFLAKE and SNOWFLAKE_SAMPLE_DATA. Within the SNOWFLAKE database, under the ACCOUNT_USAGE schema, you should find the ACCESS_HISTORY view for Snowflake Data Lineage, as shown in the image below.
This view has 6 columns:
- QUERY_ID: An internal ID used to identify the query. This can be used to join the QUERY_HISTORY view (in the same schema) in place of Snowflake Data Lineage. The QUERY_HISTORY view also contains some useful information, like the query type, query text, number of rows inserted, number of rows updated, etc.
- QUERY_START_TIME: The time (UTC) at which the query started execution
- USER_NAME: The user who ran the query
- DIRECT_OBJECTS_ACCESSED: This is a JSON column. The JSON contains the tables, views, and columns directly named in the query, either explicitly, or through a shortcut like an asterisk (*) for example.
- BASE_OBJECTS_ACCESSED: This, again, is a JSON column. Over here, the JSON contains the tables and columns of the base data. So, if your query is on a view, the JSON here will contain tables and columns that form the original data source for that view.
- OBJECTS_MODIFIED: A JSON column again. This is used for write operations, unlike the previous two JSON columns that are used for reading operations. This contains the tables, views, and columns that were modified in a write operation.
You can read more about the ACCESS_VIEW columns here. So, with this table, you can answer the following questions:
- Which query modified what data?
- Who ran that query?
- When was the query executed?
Examples for Snowflake Data Lineage
As you can see, these questions (rather, their answers), help you determine the data lifecycle. All you need to do is construct the SQL query on this view correctly. Let’s consider an example. We will do the following:
- Create a table
- Add some rows to it
- Read data
- Update some rows
We will then query the ACCESS_HISTORY table for Snowflake Data Lineage, and try to get this exact sequence of events from the table.
Create a new database in your snowflake account (I’ll call it MY_DATABASE). Now, we’ll create a table in this schema. Run the following query:
CREATE TABLE students(
name varchar,
roll_no varchar
)
Now, let us make some operations on this table. Run the following queries one after another:
INSERT INTO students VALUES ('Yash',26),
('Isha',56)
SELECT * from students
UPDATE students SET roll_no = 25 where name = 'Yash'
As you can see, we have an INSERT, followed by a reading, followed by an UPDATE.
Now, let’s see if we can get this sequence of events from the ACCESS_HISTORY view. Before that, let’s just examine the ACCESS_HISTORY view in place of Snowflake Data Lineage. Run the following query:
SELECT * from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY
You shall see the following output:
As you can see, there are some queries run by WORKSHEETS_APP_USER (these queries deal with Snowflake’s internal objects). Then, you have queries identified by your username (YASHS11 in my case). These are queries that you have made. If you look at these queries and see the JSON columns, you will see an item like the following:
[
{
"columns": [
{
"columnId": 1026,
"columnName": "ROLL_NO"
},
{
"columnId": 1025,
"columnName": "NAME"
}
],
"objectDomain": "Table",
"objectId": 1026,
"objectName": "MY_DATABASE.PUBLIC.STUDENTS"
}
]
As you can see, it is an ARRAY of JSON objects. Each JSON object in that array lists the objectId, objectDomain (essentially the type of the object), the objectName (the table name in the above example), and the relevant columns (read or modified). The columns field is also a JSON array, with each JSON within the array containing the columnId and the columnName.
Now, with this information, we can try to find the history of actions taken on the students table.
Run the following query:
SELECT * from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY
WHERE base_objects_accessed[0].objectName = 'MY_DATABASE.PUBLIC.STUDENTS'
OR objects_modified[0].objectName = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc
You will see 3 rows in the output, corresponding to the following operations (in descending order of time):
- UPDATE
- READ
- INSERT
As you can see, DIRECT_OBJECTS_ACCESSED and BASE_OBJECTS_ACCESSED are empty JSONs in case of the INSERT statement, and OBJECTS_MODIFIED is empty in case of the READ statement.
You may have noted that you created the above query. You made use of the fact that the base_objects_accessed or the objects_modified arrays had only one entry for the queries of our concern. Therefore, You conveniently took the 0th index of these arrays.
However, that may not always be the case. You may have several queries that involve multiple tables and/or views. So, how do you deal with that? Below is one example of a query that can handle this (please note that it requires knowledge of LATERAL, FLATTEN, ARRAY_SIZE, and CASE from the Snowflake SQL syntax):
SELECT t.*
from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t,
LATERAL FLATTEN(CASE WHEN ARRAY_SIZE(t.base_objects_accessed)>0 then t.base_objects_accessed ELSE t.objects_modified END) b
WHERE b.value:"objectName" = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc
This gives the same output as before.
We are essentially breaking down the JSON array into multiple rows, each containing one array element (if base_objects_modified is an empty array, we consider the objects_modified array). We then laterally join this to our access_history table and run the WHERE query on the flattened table.
You can even join the query_history table to determine the query that actually accessed the table.
SELECT t.*, q.query_text
from SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY t
LEFT JOIN SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY q on q.query_id = t.query_id,
LATERAL FLATTEN(CASE WHEN ARRAY_SIZE(t.base_objects_accessed)>0 then t.base_objects_accessed ELSE t.objects_modified END) b
WHERE b.value:"objectName" = 'MY_DATABASE.PUBLIC.STUDENTS'
ORDER BY query_start_time desc
This will showcase the query_text along with the other items of the ACCESS_HISTORY table.
Conclusion
In this article, you learnt about the ACCESS_HISTORY view in Snowflake and how it can be used as a proxy in place of Snowflake Data Lineage. We saw an example of making certain operations on a table, and we saw how those operations were extracted from the ACCESS_HISTORY view in the exact order in which they were performed. We also saw how to fetch the exact query that made the changes in the table, using the QUERY_HISTORY view.
Visit our Website to Explore Hevo
Companies have business data available in multiple sources, and it’s a tedious process to load data manually from data sources to Snowflake. Hevo Data is a No-code Data Pipeline that can help you transfer data from any data source to the desired Snowflake. It fully automates the process to load and transform data from 100+ data sources to a destination of your choice without writing a single line of code.
Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.
Share your experience of learning about Snowflake Data Lineage in the comments section below!
Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.