• Data ingestion is the process of gathering data from different sources and loading it into a centralized repository.
  • It involves discovering, acquiring, and transforming data before loading it into a destination. The ingested data can be easily accessed to perform data analytics.
  • There are several databases or data warehouses that you can use to ingest data. Many organizations opt for Snowflake data ingestion because of its fast performance and pricing structure.

This article provides information on how to ingest data into Snowflake and best practices to utilize it for in-depth data analytics.

What is Snowflake Data Ingestion? 

Snowflake data ingestion is a process of collecting data from multiple sources in various formats and transferring it for storage in Snowflake.

You can then analyze this data efficiently using the platform’s different features. Snowflake can process structured and semi-structured data types in real-time or in batches.

It automates data ingestion and handles high data loads by scaling compute resources on demand. While using Snowflake, the location of data files in cloud storage is known as stage. Here are the different types of stages in Snowflake: 

  • External Stage: The data ingestion from the external stage involves loading data from services like Amazon S3, Google Cloud Platform, and Microsoft Azure. This is supported regardless of the cloud platform that hosts your Snowflake account. 
  • Internal Stage: Snowflake offers different stage types in your account, such as:
    • User: It is designed to store and stage files used by a single user but can be loaded into numerous tables. 
    • Table: In this stage, you can store and stage files that are used by many users but can be loaded to a single table. 
    • Named: It is a database object created in the schema. The named internal stage stores files used by more than one user and can be loaded to multiple tables. You cannot alter or drop the user and table stage, but as the named stage is a database object, it can be controlled using security access control privileges. You can, therefore, create, drop, or alter named stages. 

Creating Stages in Snowflake

You can create stages using the CREATE STAGE command in Snowflake as follows:

Internal Stage

CREATE [ OR REPLACE ] [ { TEMP | TEMPORARY } ] STAGE [ IF NOT EXISTS ] <internal_stage_name>
    internalStageParams
    directoryTableParams
  [ COMMENT = '<string_literal>' ]
  [ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]

Where the internalStageParams for the internal stage can be elaborated as follows:

internalStageParams ::=
  [ ENCRYPTION = (TYPE = 'SNOWFLAKE_FULL' | TYPE = 'SNOWFLAKE_SSE') ]

The directoryTableParams for the internal stage is: 
directoryTableParams (for internal stages) ::=
  [ DIRECTORY = ( ENABLE = { TRUE | FALSE }
                  [ REFRESH_ON_CREATE =  { TRUE | FALSE } ] ) ]

External Stage

CREATE [ OR REPLACE ] [ { TEMP | TEMPORARY } ] STAGE [ IF NOT EXISTS ] <external_stage_name>
    externalStageParams
    directoryTableParams
  [ COMMENT = '<string_literal>' ]
  [ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]

Here, the externalStageParams and directoryTableParams can be elaborated as follows:

  • Amazon S3
externalStageParams (for Amazon S3) ::=
  URL = { 's3://<bucket>[/<path>/]' | 's3gov://<bucket>[/<path>/]' }
  [ { STORAGE_INTEGRATION = <integration_name> } | { CREDENTIALS = ( {  { AWS_KEY_ID = '<string>' AWS_SECRET_KEY = '<string>' [ AWS_TOKEN = '<string>' ] } | AWS_ROLE = '<string>'  } ) } ]
  [ ENCRYPTION = ( [ TYPE = 'AWS_CSE' ] [ MASTER_KEY = '<string>' ] |
                   [ TYPE = 'AWS_SSE_S3' ] |
                   [ TYPE = 'AWS_SSE_KMS' [ KMS_KEY_ID = '<string>' ] ] |
                   [ TYPE = 'NONE' ] ) ]

directoryTableParams (for Amazon S3) ::=
  [ DIRECTORY = ( ENABLE = { TRUE | FALSE }
                  [ REFRESH_ON_CREATE =  { TRUE | FALSE } ]
                  [ AUTO_REFRESH = { TRUE | FALSE } ] ) ]
  • Google Cloud Platform
externalStageParams (for Google Cloud Storage) ::=
  URL = 'gcs://<bucket>[/<path>/]'
  [ STORAGE_INTEGRATION = <integration_name> ]
  [ ENCRYPTION = ( [ TYPE = 'GCS_SSE_KMS' ] [ KMS_KEY_ID = '<string>' ] | [ TYPE = 'NONE' ] ) ]

directoryTableParams (for Google Cloud Storage) ::=
  [ DIRECTORY = ( ENABLE = { TRUE | FALSE }
                  [ AUTO_REFRESH = { TRUE | FALSE } ]
                  [ REFRESH_ON_CREATE =  { TRUE | FALSE } ]
                  [ NOTIFICATION_INTEGRATION = '<notification_integration_name>' ] ) ]
  • Microsoft Azure
externalStageParams (for Microsoft Azure) ::=
  URL = 'azure://<account>.blob.core.windows.net/<container>[/<path>/]'
  [ { STORAGE_INTEGRATION = <integration_name> } | { CREDENTIALS = ( [ AZURE_SAS_TOKEN = '<string>' ] ) } ]
   [ ENCRYPTION = ( [ TYPE = 'AZURE_CSE' ] [ MASTER_KEY = '<string>' ] | [ TYPE = 'NONE' ] ) ]

directoryTableParams (for Microsoft Azure) ::=
  [ DIRECTORY = ( ENABLE = { TRUE | FALSE }
                  [ REFRESH_ON_CREATE =  { TRUE | FALSE } ]
                  [ AUTO_REFRESH = { TRUE | FALSE } ]
                  [ NOTIFICATION_INTEGRATION = '<notification_integration_name>' ] ) ]

You can upload files from the local system to any internal stage using the PUT command:

PUT file://<path_to_file>/<filename> internalStage
    [ PARALLEL = <integer> ]
    [ AUTO_COMPRESS = TRUE | FALSE ]
    [ SOURCE_COMPRESSION = AUTO_DETECT | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE ]
    [ OVERWRITE = TRUE | FALSE ]

These staged files can be in CSV, JSON, or any other supported format.

Different Ways of Data Ingestion in Snowflake

The staged files from the external or internal stage of your Snowflake account can be transferred to a Snowflake table. You can use batch loading, continuous loading, real-time loading with Kakfa, or an ELT tool to ingest data into Snowflake. Let’s take a look at each of these in detail:

Batch Loading

The batch data loading involves bulk loading of data from staged files in the form of batches to the target tables in Snowflake. There are two approaches for batch data ingestion:

Snowflake Data Ingestion: Batch Loading
Snowflake Data Ingestion: Batch Loading

Batch Loading With Your Own Virtual Warehouse

It is the most common way of loading data to Snowflake. In this, you have to run the COPY INTO command in one of your managed virtual warehouses. The Snowflake virtual warehouses come in various sizes, and you can use any of them to meet your needs. It is advisable to start with a small-sized warehouse to reduce the charges and then increase the size as you require. The smallest virtual warehouse (X-Small) can ingest eight files in parallel, and the number doubles with an increase in each size. 

Batch loading using your own virtual warehouse is one of the best ways to ingest data in Snowflake. However, you should ensure that your number of files and warehouse size are compatible to avoid paying for unused compute resources.  

Batch Loading With Serverless Tasks

You can use serverless tasks to run the COPY INTO command for batch data ingestion. The serverless tasks leverage Snowflake’s resources instead of your own managed virtual warehouses. While using it, you must first replace the WAREHOUSE parameter with the USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE parameter in your query. It represents the size of the compute resource for the first few iterations of tasks. After executing a few tasks, Snowflake uses metadata for these tasks to identify the right size of computing resource required for further execution. 

You only have to pay per second of compute used when using serverless tasks. It resolves the underutilization problem encountered when using your own virtual warehouse. However, the compute cost of using serverless tasks is 1.5 times that required for using your own virtual warehouse of equivalent size. So, it is advisable to use this method only when you cannot completely saturate your virtual warehouse to avoid paying idle compute costs. 

The COPY INTO command for batch loading is as follows:

For standard data load:

COPY INTO [<namespace>.]<table_name>
     FROM { internalStage | externalStage | externalLocation }
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
                    TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
[ copyOptions ]
[ VALIDATION_MODE = RETURN_<n>_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS ]

For data load with transformation

COPY INTO [<namespace>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
     FROM ( SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] ... ]
            FROM { internalStage | externalStage } )
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
                    TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
[ copyOptions ]

Continuous Loading

Snowflake Data Ingestion: Continuous Loading
Continuous Loading
  • Continuous data loading is initiated when an event occurs, such as the addition of a new file to source storage. This event triggers a notification and automatically starts the data ingestion process.
  • You can use cloud-based messaging services like AWS SNS or SQS to trigger this notification. These messaging services can be integrated with Snowflake to inform it about newly added files. 
  • In continuous loading, data is constantly loaded to the stages, which reduces data latency. To facilitate this, Snowflake offers a feature called Snowpipe that integrates with event notifying services such as AWS, SNS or SQS.
  • The Snowpipe is associated with the COPY command, which gets triggered when any new files are added to the stage.

You can also use the auto-ingest feature of Snowpipe, which enables files to be automatically loaded to Snowflake as they arrive. It also facilitates using REST API to enable you to choose when the Snowflake should get triggered. 

Snowpipe costs 0.06 credits/1000 processed files. To optimize usage costs, you should compress the files while using Snowpipe. 

Real-Time Loading With Kafka

For real-time data loading, you can use a message broker service like Apache Kafka. It receives messages from different data producers that are data sources and sends them to different data consumers like Snowflake. Producers and consumers are client applications that write and read data in Kafka.

Depending upon the type of ingestion that you want to perform, there are two ways to use Apache Kafka for Snowflake real-time data ingestion. These are as follows:

Kafka Connector Snowpipe Mode

Snowflake Data Ingestion: Kafka Snowpipe Connector Loading
Kafka Snowpipe Connector Loading

In this method, Kafka combines with Snowpipe using micro-batched files along with Snowpipe. The messages received from Kafka are flushed into temporary files and then ingested through Snowpipe to Snowflake. The Kafka connector also enables you to determine the frequency of new file creation. You can use this mode to reduce the ingestion latency as well as the usage cost of Snowflake. 

Kafka Snowpipe Streaming

Snowflake Data Ingestion: Snowpipe Streaming
Snowpipe Streaming

Snowpipe streaming is the newly introduced service of Snowflake in which data is loaded in row-by-row format. It results in lower latency as compared to bulk data load or Snowpipe mode in which data is ingested from staged files. 

The cost of Snowpipe streaming is one credit per compute-hour, compared to Snowpipe mode’s 1.25 credit per compute-hour. This makes Snowpipe streaming the most cost-effective method for loading data to Snowflake. 

The different data ingestion Snowflake approaches can be summarized as follows:

Snowflake Data Ingestion: Types of Snowflake Data Ingestion

ELT Tools Like Hevo Data

ELT is a data processing technique that involves data extraction, loading, and transformation and can aid you in the data ingestion process. To simplify the complexities of Snowflake data ingestion, you can choose a tool from several third-party Snowflake data ingestion tools that facilitate ELT. 

Hevo Data is one such tool. It is a no-code ELT platform that provides near real-time data integration and a cost-effective way to automate your data pipeline workflow. With over 150 source connectors, you can integrate data from multiple platforms into Snowflake, conduct advanced analysis on your data, and produce useful insights.

Here are some of the most important features provided by Hevo Data:

  • Data Transformation: Hevo Data allows you to transform your data for analysis with simple Python-based and drag-and-drop data transformation techniques. It simplifies the cleaning and formatting of data before loading it into the Snowflake warehouse. 
  • Automated Schema Mapping: Hevo Data automatically arranges the destination schema to match the incoming data. It also lets you choose between Full and Incremental Mapping. The automated schema mapping feature of Hevo eliminates the need for manual schema mapping and improves the accuracy of the data ingestion process.
  • Incremental Data Load: It ensures proper bandwidth utilization at both the source and the destination by allowing real-time data transfer of the modified data. Hevo’s incremental data load feature reduces processing time by focusing only on new or updated data while reloading. Lower processing time reduces the usage cost of Snowflake’s compute resources. 

Hevo enables you to specify the ingestion and loading frequency for loading data into Snowflake from various sources. The platform facilitates either pull-based or push-based data ingestion. You can explore several other features of the platform by scheduling a demo.

A Brief Comparison of Different Snowflake Data Ingestion Methods

After exploring all the methods, here’s a quick tabular comparison summarizing the different Snowflake data ingestion approaches:

Batch Data IngestionContinuous Data IngestionReal-time Data IngestionELT Using Hevo
Data is loaded in bulk in the form of batches to the staging files.Data is loaded in the form of micro-batches to the staging files.Data is loaded in the form of micro-batches in Kafka Snowpipe mode while it is loaded in row-by-row format during Snowpipe streaming. It enables either pull-based or push-based data ingestion.
The data is ingested using the COPY command or Snowflake’s serverless task.The data is ingested using the Snowpipe feature of Snowflake.Here, the data is ingested through either Kafka Snowpipe mode or Kafka Snowpipe streaming. The data is ingested in an automated way through the ELT process.
Data latency is high.Data latency is lower than that of batch ingestion.The latency of Kafka Snowpipe mode is similar to continuous loading, while Kafka Snowpipe streaming has the least data latency as data is ingested in seconds. This also has the least data latency as the frequency of ingestion can be adjusted through ingestion frequency and loading frequency features.

Best Practices for Data Ingestion With Snowflake

You can adopt the following best practices to optimize spending on data ingestion in Snowflake:

Monitor the Usage and Cost of Ingestion

While using COPY_INTO and Snowpipe, you can use the COPY_HISTORY function to determine the service usage information and data load history. For Snowpipe streaming, you can use the SNOWPIPE_STREAMING_FILE_MIGRATION_HISTORY view to load history. You can monitor the cost through the pipe_usage_history view. For instance, the following example query gives the complete list of pipes and credits consumed over the last 30 days while using Snowflake:

SELECT TO_DATE(start_time) AS date,
  pipe_name,
  SUM(credits_used) AS credits_used
FROM snowflake.account_usage.pipe_usage_history
WHERE start_time >= DATEADD(month,-1,CURRENT_TIMESTAMP())
GROUP BY 1,2
ORDER BY 3 DESC;

Ensure Alignment Between Ingestion Mode and the Number, Size, and Path of Files

In addition to the compute charges, you have to pay some overhead charges per file while using Snowpipe. So, it is better to use the COPY_INTO command rather than Snowpipe for data ingestion of lower-sized files.

It is recommended that you use files in the range of 100-250 MiB when working with Snowpipe. You should aggregate smaller-sized files for batch ingestion.

However, your file size should not exceed 5GiB, and you can separate them to get the benefit of the parallelization feature of Snowflake. A large file can result in the failure of the data ingestion process.

Compress Files Before Ingestion 

Snowflake facilitates the decompression of files during ingestion if they are loaded in compressed format. It can automatically detect various compression formats, such as GZIP, BZ2, SNAPPY, or ZSTAD. The file structure, such as nested schema, also affects Snowflake’s ingestion performance. 

For instance, a compressed file with few columns but a nested structure will take the same time for ingestion as an uncompressed file with many columns.

However, loading compressed files is one of the most important best practices for data ingestion with Snowflake.  It is better to load a compressed file as Snowflake’s time required for decompression is less than the time required to load the same file in uncompressed format.  

Use Snowpipe More Frequently Than COPY_INTO Command

Snowpipe loads data in micro-batches instead of bulk loading as in batch ingestion. As a result, Snowpipe reduces data latency, which is not supported by the COPY command. Also, Snowpipe integrates with messaging services like AWS SQS, SNS, or REST APIs to detect new files, which is not feasible while using the COPY statement. 

Snowpipe is a serverless feature of Snowflake, so you do not have to worry about setting up a virtual warehouse to run Snowpipe code.

Snowflake automatically facilitates the compute cluster for Snowpipe. Also, the COPY command requires scheduling to run it at the correct time. Snowpipe is triggered automatically with the help of event notification services or REST APIs. 

Use Snowpipe Streaming to Reduce Latency

You should opt for Snowpipe streaming mode for low-latency Snowflake data ingestion. This mode loads data in a row-by-row format, contrary to the bulk loading in Snowpipe mode. Also, the usage cost of Snowpipe streaming is less than Snowpipe or batch ingestion. Additionally, Snowpipe eliminates the use of staging files and directly ingests data from the source. 

Use Cases of Snowflake Data Ingestion

There are several applications of Snowflake data ingestion, such as:

  • Finance Sector: Snowflake data ingestion in the finance sector can help create a centralized repository for trade surveillance, fraud detection, and anti-money laundering measures. 
  • Healthcare: It can be used in the healthcare industry to maintain patient health data records and gain useful insights for novel medical research. 
  • Manufacturing: Snowflake data ingestion can be used in several industries to carry out smart, data-driven manufacturing. It can streamline the supply chain and improve operational efficiency through real-time data analytics.
  • Retail Sector: Snowflake data ingestion can help maintain and analyze customer data in the retail sector. Customer data analysis contributes to developing a robust product strategy and delivering enhanced customer services. 

Conclusion 

  • This blog explains different approaches to Snowflake data ingestion and provides insights on best practices and use cases.
  • To make data ingestion easier in Snowflake, you can opt for third-party data ingestion tools like Hevo Data.
  • It is a zero-code data integration tool with efficient data transformation, security, and ingestion capabilities. 

FAQs 

Q1. Does Snowflake charge for data ingestion? 

During data ingestion, you utilize Snowflake’s compute resource and are charged for it according to your usage. However, Snowflake does not charge for ingress or initial entry of data, but it charges for data egress. 

    Q2. Can Snowflake handle unstructured data? 

    Snowflake does not directly ingest unstructured data but ingests it through external tables or in staging files. It stores, governs, and processes unstructured data just like it does for structured or semi-structured data.

      mm
      Customer Experience Engineer, Hevo Data

      Dimple is an experienced Customer Experience Engineer with four years of industry proficiency, including the last two years at Hevo, where she has significantly refined customer experiences within the innovative data integration platform. She is skilled in computer science, databases, Java, and management. Dimple holds a B.Tech in Computer Science and excels in delivering exceptional consulting services. Her contributions have greatly enhanced customer satisfaction and operational efficiency.