Snowflake is a data warehouse that provides various services for advanced data analytics. Snowflake Cortex is one such service. It is a fully managed feature that offers you a set of functions to leverage artificial intelligence (AI) and machine learning (ML) capabilities. You can use Snowflake Cortex in complex data applications to perform high-level data processing tasks without much knowledge of AI and ML. 

In this article, you will learn about Cortex functions in Snowflake and their applications to simplify various processing tasks related to advanced data applications. 

What is Snowflake Cortex? 

Snowflake Cortex is a fully managed Snowflake service that you can use for artificial intelligence (AI) and machine learning (ML) applications. It provides you with a set of pre-built ML and LLM (Large Language Models) functions. You can use them to simplify tasks like extracting information from unstructured or semi-structured data, forecasting, and anomaly detection.

It enables you to access prominent AI models like Google or Meta through SQL or Python serverless functions. It allows Snowflake to efficiently handle tasks related to AI model optimization and GPU infrastructure management. You can then focus on other important data analytics tasks to improve your organization’s growth. 

Core Capabilities 

You can broadly categorize the core capabilities of Cortex Snowflake into the following two sets of functions:

  • LLM Functions: These are SQL and Python functions that allow you to utilize large language models to understand, query, translate, summarize, and generate free-form text. You can use LLM functions such as COMPLETE, EXTRACT_ANSWER, or SUMMARIZE to draw useful insights from unstructured and semi-structured data. 
  • ML Functions: ML functions are SQL functions that facilitate predictive analytics through machine learning to help you understand the structure of your data and fasten the data analytics process. These functions provide highly scalable and accurate ML models to forecast and explore the factors affecting your data, detect anomalies, and find outliers in your databases. 

LLM Functions 

The various LLM functions of Snowflake Cortex can be explained as follows:

COMPLETE

You can use this function to complete prompts using the LLM of your choice. The syntax for the COMPLETE function is as follows:

SNOWFLAKE.CORTEX.COMPLETE(
    <model>, <prompt_or_history> [ , <options> ] )

Here, <model> is the LLM of your preference, and <prompt_or_history> is the prompt or conversation history you want to complete. 

Take a look at an example of the COMPLETE function:

SELECT SNOWFLAKE.CORTEX.COMPLETE('snowflake-arctic', 'What are large language models?');

EXTRACT_ANSWER

The EXTRACT_ANSWER function provides you with an answer to the question in a text document. The text can be in English language or in the form of a string data type of semi-structured(JSON) data object. The syntax of the EXTRACT_ANSWER function is as follows:

SNOWFLAKE.CORTEX.EXTRACT_ANSWER(
    <source_document>, <question>)

Here, the <source_document> is the document containing your answer. 

Below is an example of the usage of the EXTRACT_ANSWER function. The review_content is a column of a review table that you are using to extract the answer. 

SELECT SNOWFLAKE.CORTEX.EXTRACT_ANSWER(review_content,
    'What dishes does this review mention?')
FROM reviews LIMIT 10;

EMBED_TEXT_768

 The EMBED_TEXT_768 creates vector embedding in the English language from your given data. Vector embedding is a practice of converting text or image data into numerical format. The syntax for using EMBED_TEXT_768 function is:

SNOWFLAKE.CORTEX.EMBED_TEXT_768( <model>, <text> )

Here, <model> is the LLM that you are using, and <text> is your input data.

Consider an example of generating vector embedding for the phrase ‘hello world’:

SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'hello world');

SUMMARIZE 

You can use this function to get a summary of any text in the English language. Below is the syntax for it:

SNOWFLAKE.CORTEX.SUMMARIZE(<text>)

Consider the following example of using the SUMMARIZE function:

SELECT SNOWFLAKE.CORTEX.SUMMARIZE(review_content) FROM reviews LIMIT 10;

SENTIMENT

The SENTIMENT function gives an output in the form of a score between -1 and 1 for an input in the form of English language text. The score indicates the sentiment of the text. -1 represents a negative sentiment, 0 is a neutral sentiment, and 1 indicates a positive sentiment. You can use the following syntax for it:

SNOWFLAKE.CORTEX.SENTIMENT(<text>)

Consider an example of a table named review containing your customer reviews. You can run the following query to understand the sentiment score of the first ten reviews.

SELECT SNOWFLAKE.CORTEX.SENTIMENT(review_content), review_content FROM reviews LIMIT 10;

TRANSLATE

The TRANSLATE function translates text into your desired language. The syntax is as follows:

SNOWFLAKE.CORTEX.TRANSLATE(
    <text>, <source_language>, <target_language>)

The languages supported by the TRANSLATE function are:

Snowflake Cortex: Languages supported by TRANSLATE function 
Languages supported by TRANSLATE function 

Consider an example of an English-to-German translation of the first ten rows of a column named review_content from a table named review:

SELECT SNOWFLAKE.CORTEX.TRANSLATE(review_content, 'en', 'de') FROM reviews LIMIT 10;

ML Functions 

The Snowflake Cortex ML functions are as follows:

Forecasting

The forecasting function is an ML function that enables you to predict future data using historical time series data. It generates univariate predictions on future trends of a single series or multiple time series data. The input data should include:

  • A timestamp column with a fixed frequency, for instance, every 1 hour or 5 minutes. 
  • A target column representing a variable of your interest for which you want to make predictions at each timestamp.

Historical data can also include data that might have influenced the target. It should be in columnar form with numerical or character format. These are called exogenous variables

To create forecasts, you can use FORECAST.SNOWFLAKE.ML, a Snowflake built-in class, and follow the below steps:

  1. First, create a forecast model object by providing training data using the following syntax:
CREATE [ OR REPLACE ] SNOWFLAKE.ML.FORECAST [ IF NOT EXISTS ] <model_name>(
  INPUT_DATA => <input_data>,
  [ SERIES_COLNAME => '<series_colname>', ]
  TIMESTAMP_COLNAME => '<timestamp_colname>',
  TARGET_COLNAME => '<target_colname>',
  [ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
  1. You can then use this forecast model object and call the ‘forecast’ method to predict future values of the target data. You can pass the future timestamps in steps and also add the future values of exogenous variables. 

Here are the syntaxes for various use cases to generate forecasts using previously trained models.

  • Single-series model without exogenous variables:

<name>!FORECAST(

  FORECASTING_PERIODS => <forecasting_periods>,

  [ CONFIG_OBJECT => <config_object> ]

);

  • Single-series model with exogenous variables:
<name>!FORECAST(
  INPUT_DATA => <input_data>,
  TIMESTAMP_COLNAME => '<timestamp_colname>',
  [ CONFIG_OBJECT => <config_object> ]
);
  • Multi-series model without exogenous variables:
<name>!FORECAST(
  SERIES_VALUE => <series>,
  FORECASTING_PERIODS => <forecasting_periods>,
  [ CONFIG_OBJECT => <config_object> ]
);
  • Multi-series model with exogenous variables:
<name>!FORECAST(
  SERIES_VALUE => <series>,
  SERIES_COLNAME => <series_colname>,
  INPUT_DATA => <input_data>,
  TIMESTAMP_COLNAME => '<timestamp_colname>',
  [ CONFIG_OBJECT => <config_object> ]
);

Anomaly Detection

Anomaly detection is the process of identifying exception values or outliers in data. In Snowflake Cortex, anomaly detection is a type of ML function that detects discrepancies in datasets. It allows you to train a model to identify deviated values in your data. You can use it for single-series or multi-series data, and it must include:

  • A timestamp column with a fixed frequency. For example, hourly or every 5 minutes.
  • A target column representing a variable of your interest at each timestamp.

To detect deviating values in your data, you can use the Snowflake built-in class ANOMALY_DETECTION (SNOWFLAKE.ML) and follow the below steps: 

  1. Create an anomaly detection object by providing training data to the model using the following syntax:
CREATE [ OR REPLACE ] SNOWFLAKE.ML.ANOMALY_DETECTION <model_name>(
  INPUT_DATA => <reference_to_training_data>,
  [ SERIES_COLNAME => '<series_column_name>', ]
  TIMESTAMP_COLNAME => '<timestamp_column_name>',
  TARGET_COLNAME => '<target_column_name>',
  LABEL_COLNAME => '<label_column_name>',
  [ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
  1. You can use this anomaly detection model object and call the <model_name>!DETECT_ANOMALIES method to detect anomalies. It uses the model to identify outliers in the data. 
<model_name>!DETECT_ANOMALIES(
  INPUT_DATA => <reference_to_data_to_analyze>,
  TIMESTAMP_COLNAME => '<timestamp_column_name>',
  TARGET_COLNAME => '<target_column_name>',
  [ CONFIG_OBJECT => <configuration_object>, ]
  [ SERIES_COLNAME => '<series_column_name>' ]
)

Contribution Explorer 

Contribution Explorer is an ML function that helps you analyze the root cause of changes in certain metrics within your datasets. For instance, while observing sales datasets for your enterprise, you can use Contribution Explorer to analyze changes in your revenue based on metrics such as location, age, and gender of your customers. 

To implement Contribution Explorer, your dataset should fulfill the following criteria:

  • It should have one or more non-negative metrics. The change in metric from one row to the next can be negative, but the metric itself must never be negative. 
  • The datasets should consist of one or more timestamps.
  • It should have columns or parameters that can be used to segment the data. 

You can use the TOP_INSIGHTS table function to deploy Contribution Explorer in your datasets. It finds the most important parameters of your dataset and segments the data within these parameters to detect which of these are impacting your analysis results. The syntax for the TOP_INSIGHTS function is as follows:

SNOWFLAKE.ML.TOP_INSIGHTS(
  <categorical_dimensions>, <continuous_dimensions>,
  <metric>, <label> )

Consider an example for which you have to first create a table named input_table. Then, add the control and test group data to the table. The control group is a subset of data that you can use to train the model, and the test group is a subset that you want to analyze. Once this is done, you can generate important insights from the ‘input_table’ using the following query:

WITH input AS (
  SELECT
    {
      'country': input_table.dim_country,
      'vertical': input_table.dim_vertical
    }
    AS categorical_dimensions,
    {
         'length_of_vertical': length(input_table.dim_country)
    }
    AS continuous_dimensions,
    input_table.metric,
    IFF(ds BETWEEN '2020-08-01' AND '2020-08-20', TRUE, FALSE) AS label
  FROM input_table
  WHERE
    (ds BETWEEN '2020-05-01' AND '2020-05-20') OR
    (ds BETWEEN '2020-08-01' AND '2020-08-20')
)
SELECT res.* from input, TABLE(
  SNOWFLAKE.ML.TOP_INSIGHTS(
    input.categorical_dimensions,
    input.continuous_dimensions,
    CAST(input.metric AS FLOAT),
    input.label
  )
  OVER (PARTITION BY 0)
) res ORDER BY res.surprise DESC;

Classification

Classification is one of the Snowflake Cortex ML functions that use ML algorithms to sort data into various classes by detecting patterns in the training data. It is currently in the public preview stage and can support binary and multi-class classification. The most common use cases of classification are analyzing customer churn rates, spam, or fraud detection. 

To create classification, you must first create a classification model object by providing training data. You can then use this model to classify new data points to check the model’s accuracy. 

CREATE [ OR REPLACE ] SNOWFLAKE.ML.CLASSIFICATION [ IF NOT EXISTS ] <model_name> (
    INPUT_DATA => <input_data>,
    TARGET_COLNAME => '<target_colname>',
    [CONFIG_OBJECT => <config_object>],
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]

Features Powered by Snowflake Cortex 

The Snowflake supports the following additional features, which are currently under preview:

Document AI

Document AI is a feature that uses Arctic-TILT, a large language model, to extract information from documents. It processes documents of various formats and extracts data from both textual paragraphs and graphical form content, such as logos, handwritten text like signatures, or checkmarks. Document AI simplifies the data extraction from documents with the help of Snowflake Cortex. 

You can use Document AI to prepare workflows in the areas that require continuous processing of data, such as invoices or finance statements. It can also be used to convert unstructured data in documents to a structured format in tables. 

Universal Search

Universal Search is a feature that helps you locate database objects in your account. It also helps to search data products in the Snowflake Marketplace, Snowflake documentation topics, and Snowflake Community Knowledge Base articles.

Universal Search allows you to use natural language to make any query. For instance, when you search using the keyword ‘customer,’ you will get the data from ‘customer_name,’ ‘customer_ID,’ or ‘customer_address.’ Using this feature, you can only search object metadata and not the content within your database objects. 

Snowflake Copilot

Snowflake Copilot is an LLM-powered assistant within Snowflake Cortex that assists in data analytics tasks. It can help you generate SQL queries to extract information from datasets. The Snowflake Copilot also gives recommendations on how to improve your SQL queries to optimize their performance. You can use Snowsight to leverage the Snowflake Copilot feature in your SQL worksheets. 

Role of Hevo Data in Snowflake Cortex 

With so many versatile features and capabilities, Snowflake Cortex is a useful Snowflake service that aids in performing advanced data analytics. Data integration plays a significant role in enhancing the Cortex’s performance. 

Data integration is a process of data collection from multiple sources, followed by transformation and loading to a centralized destination. The data initially ingested during integration can be unstructured, semi-structured, or structured. You can convert it into standard form through transformation, cleaning, and compression. This gives you streamlined data, which you can use with the Cortex functions to train AI and ML models. You can use a data integration tool like Hevo Data to prepare your data for AI and ML based data analytics. 

Hevo Data is a no-code ELT platform that provides real-time data integration and a cost-effective way to automate your data pipeline workflow. With over 150 source connectors, you can integrate your data into multiple platforms, conduct advanced analysis on your data, and produce useful insights.

Here are some of the most important features provided by Hevo Data:

  • Data Transformation: Hevo Data provides you the ability to transform your data for analysis with simple Python-based and drag-and-drop data transformation techniques. It allows you to clean, filter, and compress your data before loading it to Snowflake. 
  • Automated Schema Mapping: Hevo Data automatically arranges the destination schema to match the incoming data. It also lets you choose between Full and Incremental Mapping. 

The automated schema mapping feature establishes relationships between data elements from various sources and organizes them in a structured format. This ensures the availability of data in a consistent format for Snowflake Cortex functions. It is especially useful in deploying ML functions such as forecasting, contribution explorer, and classification. 

  • Incremental Data Load: Snowflake Cortex involves training AI and ML models. Hevo’s incremental data load feature facilitates the continuous availability of new or updated data for training these models. It ensures proper bandwidth utilization at both the source and the destination by allowing near real-time data transfer of the modified data, thereby contributing to improved model performance. 

It depends on clean and well-integrated data to generate insights and build machine learning models. To achieve this, you can rely on Hevo Data, a versatile data integration tool. 

Learn how to connect DBT to Snowflake to streamline your data transformation. Our guide provides clear steps for effective setup.

Conclusion

This blog provides a detailed overview of Snowflake Cortex, comprehensive information on LLM and ML functions, and examples. By replacing complex infrastructure with simple and easy-to-use functions, It allows you to perform complicated tasks such as forecasting, anomaly detection, and predictive analysis. 

Data integration is essential to utilize Snowflake Cortex fully. For this, you can use Hevo Data, a zero-code data integration platform. It offers an extensive library of connectors for easier data integration, robust data transformation, and incremental data loading capabilities. You can schedule a demo today to take advantage of these Hevo features for optimum use of Snowflake Cortex!

FAQs  

  1. Is Snowflake Cortex free? 

You cannot use it for free, but Snowflake offers a free trial period of 30 days. After this, you will be charged according to your usage. 

  1. What is the Snowflake cortex vector data type? 

The vector data type represents geographic data and is used in geographic information system (GIS) applications for tracking locations or navigation. It is also useful for vector embedding, which involves converting text or image data into numerical format. Snowflake Cortex uses this data with LLM functions to facilitate vector embedding in different applications for location tracking and navigation purposes. 

Suraj Poddar
Principal Frontend Engineer, Hevo Data

Suraj has over a decade of experience in the tech industry, with a significant focus on architecting and developing scalable front-end solutions. As a Principal Frontend Engineer at Hevo, he has played a key role in building core frontend modules, driving innovation, and contributing to the open-source community. Suraj's expertise includes creating reusable UI libraries, collaborating across teams, and enhancing user experience and interface design.