The traditional approach to data transformation with long SQL scripts, hard-to-trace logic, and lots of manual documentation has always felt like wrestling a monolith. dbt flipped this paradigm with features like Jinja templating, the ref() function for model dependency tracking, and built-in documentation and testing. dbt allows you to write clean, reproducible, and scalable transformation logic, just like you are writing code, but directly within the data warehouse, and do it in a version-controlled, modular, and testable way.

By encouraging a layered and modular approach to transformations, dbt ensures your data pipelines are easier to scale, debug, and maintain. Think of it as designing your data like software: instead of one massive SQL query. Thus, today, dbt data modeling has become the backbone of any modern analytics workflow that aims for clarity, reproducibility, and agility.

Whether you’re just dipping your toes into dbt or you’ve been hacking together models for months, this guide will walk you through the foundations of dbt data modeling, and more importantly, how to do it right.

Here’s a sneak peek at what’s coming up:

  • What is dbt data modeling really, and how does it differ from the old-school way of doing things?
  • Set up a dbt project and understand the dbt project structure.
  • dbt four-layered architecture
  • dbt best practices
  • How to use dbt tests to catch data quality issues before they make it to production.
  • Documentation, collaboration, and deployment, along with common pitfalls and how to avoid them.

So, let us get right in, shall we?

What is dbt Data Modeling?

dbt data modeling is the process of transforming raw data into clean, structured, and analytics-ready datasets using SQL governed by software engineering principles, all within your data warehouse. dbt helps you enforce a modular approach by breaking transformations into layers and chaining models together using the ref() function, on top of data modeling principles. This lets dbt understand dependencies and build a DAG (Directed Acyclic Graph) of your data models, helping you run only what’s necessary and keep your pipeline efficient.

If you are still thinking why you should consider data modeling using dbt rather than old-school data transformation approaches, here’s why:

  • With dbt, you can split logic across multiple models and reuse it without duplicating SQL.
  • It fits naturally into Git workflows, just like tracking any project files.
  • Write test cases to catch issues like nulls, duplicates, and schema changes early in the workflow.
  • It auto-generates rich and browsable documentation.
  • Write dynamic SQL, parameterize logic, using Jinja templating.

Getting Started with dbt

Let’s break down the dbt project setup step by step.

1. Installing dbt

First, you need to install dbt on your local machine. Depending on your environment, you can use pip for Python-based environments or install dbt through Homebrew if you’re on macOS.

For example, for dbt with BigQuery:

pip install dbt-core dbt-bigquery

Or for Snowflake:

pip install dbt-core dbt-snowflake

You can find the full list of supported adapters on the official dbt website.

2. Running dbt init

After you install, go to your working directory and initialize the dbt project by running the following command:

dbt init your_project_name

For example:

dbt init hevo_data_project

The above command will create the scaffold of your dbt project. The following is the basic structure of a newly initialized dbt project:

  • models/ :  where your SQL models will live.
  • dbt_project.yml : the project’s master configuration file
  • profiles.yml : It contains your project connection details.
  • tests/, macros/, snapshots/ : optional directories for advanced features

What Happens When You Run dbt init

So, you’re at your terminal and you type:

dbt init hevo_data_project

After running the above command, you’ve just kicked off your very first dbt project. But what’s really happening behind the scenes? Let’s walk through it together.

A New Project Folder is Created

Running dbt init basically creates a folder named after your project, in this case, hevo_data_project. This will be your main project directory. Inside it, dbt sets up all the essential files and folders you’ll need.

Core Project Files Are Added

Here’s what you’ll see inside:

  • dbt_project.yml
    This file is like your project’s blueprint. It defines your project’s name, where your models live, and other configs.
  • models/ directory
    This folder is where all your dbt models live. You’ll find a simple starter model here (example.sql) to help you test that things are working.

You’ll Be Asked About Your Data Warehouse

As part of dbt init, dbt will prompt you to choose your data warehouse, like BigQuery, Snowflake, Redshift, Postgres, or others. Based on your selection, it generates a connection profile.

This connection info doesn’t live in the project folder. It’s stored separately in a file called profiles.yml.

You’re Guided Through Basic Setup

dbt asks you a few basic questions like:

  • What’s your target schema or database?
  • How should dbt authenticate with your warehouse?
  • Which development environment are you using?

Once done, your project is fully bootstrapped and ready for action.

3. Setting Up profiles.yml

dbt uses a separate file called profiles.yml to manage your database connection. It’s stored in your home directory, usually at ~/.dbt/profiles.yml

This is a sample Snowflake warehouse connection profile:

hevo_data_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: your_account
      user: your_user
      password: your_password
      role: your_role
      database: your_database
      warehouse: your_warehouse
      schema: analytics

Once connected, run the following command to test your connection and confirm if everything’s wired up correctly.

dbt debug

The 4-Layer Structure of a dbt Project

Here’s a breakdown of the four commonly used layers in a dbt project:

1. Source Layer (sources.yml)

This is where you declare your raw data, the actual tables or views from your data warehouse. You should never use source tables directly in your marts. Always route them through the staging layer for control and consistency.

Example:

version: 2

sources:
  - name: stripe
    database: raw
    schema: stripe_data
    tables:
      - name: payments

Why does it matter?

  • Sources are versioned, documented, and testable.
  • You can apply freshness checks to detect stale data.
  • Helps you separate internal models from external dependencies.

2. Staging Layer (stg_ prefix)

This is your cleaning zone, where you rename columns, fix types, and apply light filters, all while keeping the shape of the data close to the source.

Example:

SELECT
  id AS payment_id,
  customer_id,
  CAST(amount AS FLOAT) AS amount_usd,
  status
FROM {{ source('stripe', 'payments') }}

Best practices:

  • Prefix with stg_ (e.g., stg_stripe__payments)
  • Use CAST, LOWER, TRIM, SAFE_CAST to clean data
  • Rename snake_case columns for consistency

3. Intermediate Layer

This is where you can write reusable joins, window functions, or aggregations that support your business logic.

Example:

SELECT
  customer_id,
  COUNT(payment_id) AS total_payments
FROM {{ ref('stg_stripe__payments') }}
GROUP BY customer_id

When to use it?

  • You have a transformation that will be reused in multiple marts
  • You’re dealing with complex joins or calculations
  • You want to isolate business logic from source structure

4. Mart Layer (fct_, dim_ prefix)

This is the curated layer, with business-ready data used for reporting and analytics. It includes dimensions (dim_) and facts (fct_). These models are used in dashboards and reports. Accuracy and naming conventions here are critical.

Examples:

  • fct_user_activity
  • dim_customer
  • fct_orders_daily

Here’s where you apply business logic like:

  • Time-based metrics (e.g., rolling averages)
  • Case statements
  • Revenue aggregations
  • Joining multiple sources

Best Practices for Writing dbt Models

Let’s go through some practical best practices every dbt developer should follow:

  1. Use ref() Instead of Hardcoding Table Names

You should always use jinja with ref() like {{ ref('model_name') }} to refer to other models.

SELECT * FROM {{ ref('stg_orders') }}
  1. One to One Model Responsibility

Each dbt model should focus on one single logical transformation. If you have massive, tangled SQL models with dozens of joins and subqueries, break it into smaller models and chain them with ref().

  1. Consistent Naming Conventions

Use clear prefixes:

  • stg_  : For staging models
  • Int_ :  For intermediate logic
  • Dim_ : For dimension tables
  • Fct_ : For fact tables

And use double underscores (__) to separate source and entity.

Example:

stg_stripe__payments
fct_orders__daily
  1. Add Descriptions for Columns and Models

This improves auto-generated documentation with dbt docs generate. Don’t skip it, it helps your team and future you.

columns:
  - name: customer_id
    description: "Unique identifier for each customer"
  1. Materializations

Think before you choose the materialization for dbt models:

  • View : For lightweight staging models
  • Table:  For marts with heavy transformations
  • Incremental: For large datasets that don’t need full refresh
  1. DRY with Macros & Jinja

If you have repeating logic like handling nulls or formatting timestamps, create a macro in the macros folder and use it across models.

{{ format_timestamp('order_created_at') }}

Testing and Validating Your dbt Data Models

One of the super-powers of dbt is its automation with quality checks. It helps you catch the issues early and makes your data transformation workflow trustworthy.

Let us go through some ways to get the most out of dbt’s testing and validation features:

1. Built-in Generic Tests

dbt comes with predefined generic tests that are easy to set up in your schema.yml files. These help ensure basic data integrity.

Common ones include:

  • not_null: Ensure a column doesn’t have nulls
  • unique: Ensure a column has unique values
  • accepted_values: Ensure a column only contains specific values

Example:

columns:
  - name: customer_id
    tests:
      - not_null
      - unique

It more simpler to run the test than writing it, just by running the following command:

dbt test

2. Custom Tests

For more complex checks, like referential integrity or business rules, you can write custom SQL tests in the tests/ folder.

Example: Check for future timestamps

SELECT *
FROM {{ ref('stg_orders') }}
WHERE order_date > CURRENT_DATE

3. Source Freshness Checks

You can set freshness rules on your source data to ensure it’s up to date. This is great for detecting stale upstream tables.

Example in sources.yml:

freshness:
  warn_after: {count: 24, period: hour}
  error_after: {count: 48, period: hour}

Then run:

dbt source freshness

4. Validating Model Dependencies

There are time when you need to validate the dependency in your dbt workflow. Run the commands in the following sequence to verify:

#step 1
dbt deps
#step 2
dbt run
#step 3
dbt test
  • Your DAG is intact
  • Models run without breaking
  • Data quality is maintained at every step

Documentation & Collaboration in dbt

Hold on, there is yet another amazing build-in feature in dbt, yeah you guessed it right, it automatic documentation. And here’s how to turn your dbt project into a well-documented, collaborative powerhouse.

Descriptive Metadata in schema.yml

Add descriptions to each model and column, making your dbt generated docs rich and readable.

Example:

models:
  - name: fct_orders
    description: "Fact table containing one row per order."
    columns:
      - name: order_id
        description: "Unique identifier for the order."

Generate Interactive Documentation

Run the following command:

dbt docs generate

Then launch a beautiful web interface showing your DAG (Directed Acyclic Graph), model descriptions, sources, tests, and dependencies, basically, a visual map of your data logic.

dbt docs serve

Comment Your SQL

Even if dbt encourages modular SQL, sometimes logic gets tricky. Don’t shy away from using inline comments:

-- Remove test orders
WHERE order_status != 'TEST'

Collaborate via Git

dbt projects are just SQL files,  which means they can be version-controlled like any other software project using git technologies and follow every SDLC(Software Development Life Cycle)  principles:

  • Review pull requests
  • Track changes over time
  • Enforce naming/test standards via code reviews
  • Tag version of dbt projects

Deployment & Orchestration: Making Your dbt Models Production-Ready

Now that you know how to create, test and document dbt data models, its time to automate the magical workflow. Let us discuss, how to deploy dbt in production and keep it shining on schedule.

dbt Core vs dbt Cloud:

  • dbt Core is the open-source CLI version. You’ll need to set up orchestration yourself using tools like Airflow, Dagster, or GitHub Actions.
  • dbt Cloud is the managed SaaS version. It handles scheduling, CI/CD, logging, and user permissions out of the box, great for teams that want to move fast.

Both use the same project structure and commands.

Scheduling Jobs

In dbt Cloud, you can schedule jobs directly from the UI with flexible triggers (e.g., hourly, daily, on success/failure).

In dbt Core, integrate with schedulers like:

  • Apache Airflow using the BashOperator or dbt plugin
  • Dagster or Prefect for more modern orchestration
  • GitHub Actions for lightweight CI/CD

Example GitHub Action to run dbt nightly:

on:
  schedule:
    - cron: '0 2 * * *'  # every day at 2AM UTC
jobs:
  dbt_run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - run: dbt deps && dbt run && dbt test

Automating CI/CD with dbt

You can enable CI/CD with dbt such that:

  • Every pull request will trigger dbt run + dbt test
  • You can preview changes, catch issues early, and safely merge code into production

You can also use slim CI in dbt to only run models impacted by a change, saving tons of time.

dbt build --select state:modified+

Common Mistakes to Avoid in dbt Data Modelling

  1. Don’t you ever jump straight into building complex marts. But, always use a staging layer (stg_ models) to clean, rename, and standardize columns first.
  2. Sometimes you may feel lazy writing tests. But, always use not_null, unique, and accepted_values tests, or custom business logic checks liberally.
  3. Don’t wait to add descriptions for well maintained documentation.
  4. Views are great for development but can slow down production runs on large datasets. Materialize critical models as tables or incremental models to improve performance.
  5. Always use ref() to reference other models instead of hardcoding table names.
  6. Keep models simple and focused. If a SQL query is too long or complex, break it into smaller intermediate models.
  7. Not monitoring upstream source freshness can cause downstream data to be outdated without your knowledge. Set up freshness checks early on.

Conclusion

dbt has revolutionized the way data teams build and maintain analytical data models by bringing software engineering best practices like modularity, testing, and documentation into SQL-based transformation workflows. By structuring your dbt project into clear layers, source, staging, intermediate, and marts, you set a strong foundation for scalable and maintainable analytics. Leveraging built-in testing, thorough documentation, and automated deployment ensures your data is trustworthy and your pipelines run smoothly. Avoiding common pitfalls will save you time and headaches, making your dbt journey more effective and enjoyable.

If you’re ready to dive deeper, explore dbt source data modelling and consider how ELT tools like Hevo can complement your dbt workflows by streamlining data ingestion and orchestration.

Frequently Asked Questions (FAQs)

What is the difference between staging, intermediate, and mart models in dbt?

Staging clean and standardizes raw data, intermediate are optional layers for reusable or complex transformations and data mart models contain business logic and are used directly in reports and dashboards.

How do I decide whether to materialize a model as a table or a view in dbt?

Views for lightweight, fast-running small dataset models with small datasets, tables for large. performance-critical datasets models.

How do I write and run tests for my dbt models?

Define tests in schema.yml files. Run tests using the dbt test command after models are built.

What causes slow dbt models?

It is generally caused by complex queries, large joins, or inefficient materializations.

How can I optimize dbt model speed?

By breaking queries into smaller models, indexing source tables, choosing appropriate materializations (tables/incremental), and limiting data scanned.

Raju Mandal
Senior Data Engineer

Raju is a Certified Data Engineer and Data Science & Analytics Specialist with over 8 years of experience in the technical field and 5 years in the data industry. He excels in providing end-to-end data solutions, from extraction and modeling to deploying dynamic data pipelines and dashboards. His enthusiasm for data architecture and visualization motivates him to create informative technical content that simplifies complicated concepts for data practitioners and business leaders.