Data transformation is at the heart of modern analytics engineering, and dbt (Data Build Tool) has become a go-to solution for managing ELT pipelines. If you’re exploring dbt how to setup staging, understanding the role of staging tables and their significance is crucial. In this post, we’ll briefly explain what dbt is, dive into what “staging” means and why it’s essential, and finally walk you through setting up staging in both dbt Core and dbt Cloud environments.

What is dbt?

dbt (Data Build Tool) is an open-source tool designed for data transformation within modern data warehouses. Unlike traditional ETL (Extract, Transform, Load), dbt adopts an ELT (Extract, Load, Transform) approach, where raw data is loaded into a warehouse and then transformed using SQL-based models.

It enables data analysts and engineers to create modular, version-controlled, and testable transformations while leveraging the computational power of cloud data warehouses like Snowflake, BigQuery, and Redshift

What is Staging?

Staging is the first layer in your dbt project, where raw data from source systems is slightly transformed to create a clean, standardized foundation for further processing.

Why is Staging Important?

  • Data Consistency: Standardizes formats and naming conventions across disparate data sources.
  • Simplified Transformations: By handling initial cleaning and normalization early, later layers can focus on business logic.
  • Improved Data Ǫuality: Early testing and validation catch errors before they propagate through your data pipeline.

Think of staging as the “pre-flight check” for your data—ensuring everything is in order before it takes off into more complex transformations.

dbt Staging Models and the Staging Layer

dbt Staging Models

In a dbt project, staging models are SQL files that reside in a dedicated directory (commonly named models/staging/). These models perform the initial cleaning and transformation of your raw data. Here are some key tasks performed by staging models:

  • Column Renaming: Enforce consistent naming conventions.
  • Data Type Casting: Convert data types to ensure consistency (e.g., converting string dates to actual date types).
  • Filtering: Remove any records that don’t meet certain quality or relevance criteria.
  • Basic Calculations: Sometimes simple derivations or aggregations are performed to simplify later transformations.

Example: A Simple Staging Model

Below is a simple example of a staging model for an orders table.

This model reads raw data using the {{ source() }} function, cleans up the data by casting a date field, and selects the required columns:

-- models/staging/stg_orders.sql

with source as (
    select
        order_id,
        customer_id,
        order_date,
        total_amount
    from {{ source('raw', 'orders') }}
)

select
    order_id,
    customer_id,
    cast(order_date as date) as order_date,
    total_amount
from source

This lays the groundwork for all further analyses by ensuring that data from the orders table is in a consistent and usable format.

The Staging Layer

The staging layer is the logical grouping of these staging models. By centralizing early data transformations, it ensures that subsequent models have a reliable and consistent base to build upon.

How to Setup Staging

Below we detail the steps to set up staging in both dbt Core and dbt Cloud.

dbt How to Setup Staging – dbt Core

For those using dbt Core on their local machine, setting up a staging layer involves a few essential steps. Below is a detailed walkthrough:

Prerequisites:

  • dbt Core Installation: Ensure you have ‘dbt’ installed on your local machine.
  • Data Warehouse Connection: Configure your profiles.yml to connect to your data warehouse (e.g., Snowflake, BigQuery, Redshift).
  • Initial Project Setup: Run ‘dbt init my_project’ to create a new dbt project if you haven’t already.

Steps

1. Create a dedicated folder for staging models: models/staging/.
2. Create a YAML file (e.g., models/staging/sources.yml) to define your raw data sources.

version: 2

sources:
  - name: raw
    tables:
      - name: orders
      - name: customers

3. Write SQL files within models/staging/ to clean and standardize your data.
4. Use the {{ source() }} function to reference raw tables.
5. Execute ‘dbt run’ to build your staging models.
6. Finally, use ‘dbt test’ to validate data quality and ensure that your transformations are working as expected.

    dbt How to Setup Staging – dbt Cloud

    dbt Cloud offers an intuitive, web-based interface with additional features like scheduling, integrated development environments (IDE), and detailed logging. Here’s a step-by-step guide to setting up staging in dbt Cloud:

    Prerequisites

    • dbt Cloud Account: Sign up for a dbt Cloud account if you haven’t already.
    • Repository Integration: Ensure that your dbt project is hosted in a Git repository (e.g., GitHub, GitLab, or Bitbucket) and connected to your dbt Cloud account.
    • Data Warehouse Credentials: Have your data warehouse credentials handy to configure the connection within dbt Cloud.

    Steps

    1. Log in to your dbt Cloud account.
    2. From the dashboard, click on “Create Environment”.

    Create environment

    3. Select “Staging” as the deployment environment type in General Settings.

    create new environment

      4. In connections “Add new connection” to your warehouse.

        Add new connection

        This will open a new tab where you can put your data warehouse credentials.

        dbt how to setup staging

        5. If you have a Git repository linked, specify which branch dbt Cloud should use, else click “Save”.
        6. Ensure your project has a dedicated folder for staging models (typically models/staging/).

          Additional Steps for dbt Cloud staging (Optional):

          Edit and Commit Your Staging Models:

          • You can either use the built-in IDE in dbt Cloud or your preferred local code editor.
          • Commit your changes to Git. dbt Cloud will automatically sync the changes from your repository. Set Up Job Scheduling and monitoring:
          • Navigate to the “Jobs” section in dbt Cloud.
          • Create a new job and configure it to run your staging models on a regular schedule. Schedule your job to run during off-peak hours to minimize impact on your data warehouse performance.)
          • After scheduling, monitor the job runs via the dbt Cloud dashboard.
          • Use the detailed logs to identify any issues during model execution.

          Best Practices for Setting Up Staging

          Whether you’re using dbt Core or dbt Cloud, adhering to best practices can dramatically improve your staging layer’s effectiveness and maintainability:

          • Consistent Naming Conventions: Use clear, descriptive names for your staging models (e.g., stg_orders, stg_customers). Consistency makes it easier for team members to navigate the project.
          • Document Everything: Leverage dbt’s built-in documentation features. Write detailed descriptions in your YAML files for each model and source. Good documentation speeds up onboarding and troubleshooting.
          • Write Tests Early: Implement tests on your staging models to catch data quality issues as soon as possible. For example, add tests to ensure that key columns aren’t null or that dates are in the correct format.
          • Modular Design: Break down complex transformations into smaller, manageable models. This modular approach not only simplifies debugging but also promotes code reuse.
          • Version Control: Keep your dbt project in a Git repository to track changes, collaborate effectively, and roll back if needed.
          • Performance Monitoring: Regularly review the performance of your staging queries. Optimize SQL where necessary to ensure efficient data processing—especially important for large datasets.
          • Incremental Loading: For large or growing datasets, consider using incremental models to process only new or updated records rather than reprocessing the entire dataset every time.
          • Environment Parity: Maintain consistency between your local development environment (dbt Core) and your production environment (dbt Cloud). This ensures that changes are tested thoroughly before deployment.

          Conclusion

          Setting up a robust staging layer is an essential first step in building a dependable analytics pipeline with dbt. By focusing on the early transformation of raw data, staging creates a clean, consistent base that downstream models can rely on for accurate insights. Whether you prefer the flexibility of dbt Core or the managed environment of dbt Cloud, the process involves careful organization, clear source definitions, diligent testing, and thoughtful scheduling.

          In this blog, we explored:

          • What dbt is and why it’s transforming the way data teams work.
          • The importance of staging, ensuring that your raw data is cleaned and standardized before further transformations.
          • How to create staging models, with examples to illustrate the typical transformations.
          • Step-by-step instructions for setting up staging in both dbt Core and dbt Cloud, including prerequisites, detailed steps, and monitoring strategies.
          • Best practices to ensure your staging layer remains robust, maintainable, and scalable.

          By following these guidelines and utilizing the features of dbt—whether through local development or in the cloud—you can significantly enhance your data pipeline’s reliability and performance. A well- implemented staging layer not only improves data quality but also sets the stage for all subsequent analytical work, empowering your organization to make data-driven decisions with confidence.

          Take Your Data Transformations to the Next Level with Hevo Transformer

          Looking for a seamless way to transform your data without writing complex SQL scripts? Hevo Transformer integrates with dbt Core, enabling you to build, orchestrate, and manage transformations effortlessly. Try Hevo Transformer today and streamline your data workflows like never before!

          Happy transforming, and may your data always be clean!

          Sarang Ravate
          Senior Software Engineer

          Sarang is a skilled Data Engineer with over 5 years of experience, blending his expertise in technology with a passion for design and entrepreneurship. He thrives at the intersection of these fields, driving innovation and crafting solutions that seamlessly integrate data engineering with creative thinking.