Mastering dbt Observability: Key Metrics and Steps

Q: 2. How do I check if a dbt model failed?

Inspect the run_results.json file to see status, errors, and execution times.

Q: 3. How can I identify slow models?

Filter run_results.json for models with long execution times and optimize them.

Making wise business decisions and executing successful initiatives requires strong data accuracy and dependability. When decision-makers have access to accurate and trustworthy data, they can confidently move forward with well-informed strategies. On the other hand, poor analytics driven by inaccurate or unclear data can negatively impact multiple areas of the business.

Reliable data ensures that decisions are meaningful and effective, while unreliable data can lead a company in the wrong direction. That’s why it’s essential to maintain data quality throughout every stage of its lifecycle. As organizations collect and process more data every day, ensuring its quality becomes increasingly challenging.

This is where dbt (Data Build Tool) observability comes in. It helps transform raw data into clean, structured datasets while enabling visibility, testing, and monitoring across your data pipeline.

In this blog, we’ll explore what dbt observability is, why it matters, its key components, best practices, and the top tools and integrations that support it.

Table of Contents

What is dbt Observability?

In the context of data pipelines, observability refers to the ability to monitor and understand the internal state of a system based on its external outputs, such as metrics, logs, or traces. dbt observability involves monitoring and tracking data transformation results, dbt models, and logs to identify any failures or anomalies in the data pipeline.

As teams increasingly deploy analytics code using dbt Core and dbt Cloud, monitoring and detecting issues early becomes more challenging. This is due to the growing complexity and fragmentation of interconnected data streams across multiple systems. Continuous observability across your jobs, tests, and models is crucial in dbt. It enables teams to quickly identify and resolve issues in the transformation process.

dbt observability helps teams to:

Monitor the health and status of dbt models and transformations
Track data quality and detect issues such as anomalies, inconsistencies, or missing data
Identify inefficiencies or performance bottlenecks in workflows
Ensure that data pipelines are accurate, scalable, reliable, and trustworthy

Why is dbt Observability Important?

The need for observability in dbt becomes clear when you encounter questions that you can’t answer in a timely manner. Questions like:

Why isn’t my model up to date?
Is my data accurate?
Why is my model taking so long to run?
How can I speed up my dbt pipeline?
How should I materialize and provision my model?

Situations like these, especially during periods of data downtime (when your data is partial or incorrect), highlight the importance of observability. If you can’t confidently answer these questions, it often means you lack sufficient visibility into your dbt deployment.

This is where dbt observability steps in:

It sends alerts to model owners and stakeholders based on custom criteria, so they are notified when specific models or tests fail or when data sources become unreliable.
It surfaces valuable insights that help analytics engineers optimize models and identify pipeline bottlenecks.
It collects and acts on metadata in near real time, whether the dbt pipeline run succeeds or fails.

Here are a few key reasons why dbt observability is essential:

Builds trust in data: It detects broken or failed models early and flags anomalies in schema, freshness, or volume.
Enables faster debugging: It tracks errors using lineage and logs, helping pinpoint the root cause of failures.
Improves performance: It monitors long-running models, optimizes slow queries, and tracks historical build times.
Supports compliance and auditing: It maintains a full history of test results, changes, and execution logs, offering transparency across teams and stakeholders.

Key Components of dbt Observability

To truly benefit from dbt observability, it’s important to focus on the following key areas:

Run Logs: These provide visibility into the success or failure of individual transformations. Logs typically include information such as execution time, status (success or failure), and error messages when something goes wrong.
Tests: Users can define tests, such as checks for null values or uniqueness, for their models. These tests run automatically as part of the dbt pipeline and their results offer a clear picture of data quality and potential issues.
Monitoring Dashboards: Integrating dbt with monitoring and observability tools enables the creation of dashboards that display real-time metrics like success rates and execution times.
Automatic Alerts: Setting up alerts helps teams identify issues early. Popular tools like Slack and email notifications are commonly used for sending these alerts.
Model Version Tracking: Observability tools can also track which versions of dbt models are running, allowing teams to correlate issues with recent codebase changes.

Implementing dbt Observability

This is a straightforward, step-by-step guide on how to set up dbt monitoring and observability in your data warehouse. Before gaining insights into the behaviour of dbt results over time, you need to first record the outcomes of each run and test in a table. We’ll go over the following procedures to accomplish that:

Determine which metadata about your dbt results should be gathered.
The findings will be stored in a table that you create.
To parse the dbt results, implement a dbt macro.
To upload the parsed dbt results to the table, create an on-run-end hook.
Update dbt jobs

On the internal level, dbt reflects the dbt project as a graph, with each node containing metadata about a resource. Your project’s nodes include models, tests, sources, seeds, snapshots, exposures, and analytics. The run result fields include information about the outcome itself. If you use the dbt run command to execute ten models, you will receive ten result objects. Considering an example of a run result, the following fields could be a good place to start:

Run result fields include status, execution time, and rows affected.
dbt graph node fields include unique id, database, schema, name, and resource type.

When a resource is executed in dbt, it produces a result object. This object holds a variety of information regarding the time and metadata of that execution. At the end of an invocation, dbt stores these objects in a file called run_results.json. Here’s a sample run_results.json file. The first step in logging the dbt results is to study the results object and construct an empty table with the fields you want to record.

Here’s an example model definition using an empty select query and a schema that corresponds to the fields we believe are useful for storing. The model’s name will be significant in the following steps, so we’ve referred to it as dbt_results.

{{  

 config(  

   materialized = 'incremental',  

   transient = False,  

   unique_key = 'result_id'  

 )  

}}  

with empty_table as (  

   select  

       null as result_id,  

       null as invocation_id,  

       null as unique_id,  

       null as database_name,  

       null as schema_name,  

       null as name,  

       null as resource_type,  

       null as status,  

       cast(null as float) as execution_time_seconds,  

       cast(null as int) as rows_affected,  

       cast(null as timestamp) as dbt_run_at

)

select * from empty_table  

-- This is a filter so we will never actually insert these values  

where 1 = 0

The result object includes numerous fields, some of which can be highly nested. To load the data into our table design above, we need a macro that flattens the resulting objects and extracts only the fields we want to store. Here’s a macro to parse the results using the fields given in the table definition above.

{% macro parse_dbt_results(results) %}  

   -- Create a list of parsed results  

   {%- set parsed_results = [] %}  

   -- Flatten results and add to list  

   {% for run_result in results %}  

       -- Convert the run result object to a simple dictionary  

       {% set run_result_dict = run_result.to_dict() %}  

       -- Get the underlying dbt graph node that was executed  

       {% set node = run_result_dict.get('node') %}  

       {% set rows_affected = run_result_dict.get('adapter_response', {}).get('rows_affected', 0) %}  

       {%- if not rows_affected -%}  

           {% set rows_affected = 0 %}  

       {%- endif -%}  

       {% set parsed_result_dict = {  

               'result_id': invocation_id ~ '.' ~ node.get('unique_id'),  

               'invocation_id': invocation_id,  

               'unique_id': node.get('unique_id'),  

               'database_name': node.get('database'),  

               'schema_name': node.get('schema'),  

               'name': node.get('name'),  

               'resource_type': node.get('resource_type'),  

               'status': run_result_dict.get('status'),  

               'execution_time_seconds': run_result_dict.get('execution_time'),  

               'rows_affected': rows_affected  

               }%}  

       {% do parsed_results.append(parsed_result_dict) %}  

   {% endfor %}  

   {{ return(parsed_results) }}  

{% endmacro %}

Now that we have a table with the required schema and a macro for extracting the relevant fields, it’s time to put everything together. As mentioned earlier, this macro relies on the results Jinja variable, which is only available in the context of an on-run-end hook. So, we need to write a macro that is triggered as an on-run-end hook and performs the following tasks:

Obtain this outcome variable as an input parameter.
Flatten the results using the macro above.
Insert it into the table we’ve already constructed.

This is how you can do it:

{% macro log_dbt_results(results) %}  

   -- depends_on: {{ ref('dbt_results') }}  

   {%- if execute -%}  

       {%- set parsed_results = parse_dbt_results(results) -%}  

       {%- if parsed_results | length  > 0 -%}  

           {% set insert_dbt_results_query -%}  

               insert into {{ ref('dbt_results') }}  

                   (  

                       result_id,  

                       invocation_id,  

                       unique_id,  

                       database_name,  

                       schema_name,  

                       name,  

                       resource_type,  

                       status,  

                       execution_time_seconds,  

                       rows_affected,  

                       dbt_run_at  

               ) values  

                   {%- for parsed_result_dict in parsed_results -%}  

                       (  

                           '{{ parsed_result_dict.get('result_id') }}',  

                           '{{ parsed_result_dict.get('invocation_id') }}',  

                           '{{ parsed_result_dict.get('unique_id') }}',  

                           '{{ parsed_result_dict.get('database_name') }}',  

                           '{{ parsed_result_dict.get('schema_name') }}',  

                           '{{ parsed_result_dict.get('name') }}',  

                           '{{ parsed_result_dict.get('resource_type') }}',  

                           '{{ parsed_result_dict.get('status') }}',  

                           {{ parsed_result_dict.get('execution_time_seconds') }},  

                           {{ parsed_result_dict.get('rows_affected') }},  

                           current_timestamp  

                       ) {{- "," if not loop.last else "" -}}  

                   {%- endfor -%}  

           {%- endset -%}  

           {%- do run_query(insert_dbt_results_query) -%}  

       {%- endif -%}  

   {%- endif -%}  

   -- This macro is called from an on-run-end hook and therefore must return a query txt to run. Returning an empty string will do the trick  

   {{ return ('') }}  

{% endmacro %}

Also, at the end, don’t forget to add this macro to the dbt_project.yml file as an on-run-end hook.

on-run-end:

- "{{ log_dbt_results(results) }}"

Because the macro will execute as a hook at the conclusion of each run, the dbt_results table must be established before any other tables. You may either perform this as a one-time thing before running your first task, or you can add it as a quick step to any pre-built jobs you currently have.

dbt run --select dbt_result

This macro will now be run at the conclusion of each dbt command execution, automatically saving the parsed output.

During implementation, we introduced more useful attributes and models that will contain all of the metadata for your dbt project in straightforward tables. This implies that we upload the metadata of every model, source, test, exposure, and metric in addition to the command results. You can then use a dbt command to update these tables with the most recent metadata on each PR whenever you make changes to a model or test. It is now feasible to build dashboards to display faulty test and execution results, track results with Slack notifications, and much more on top of these artifact tables.

Conclusion

dbt observability is essential for maintaining high-quality data pipelines. It enables data teams to track performance, detect issues early, and avoid major disruptions. This blog explores the importance of observability and highlights popular tools you can use to implement it effectively. With robust dbt observability in place, teams can stay on top of their data pipelines and extract maximum value from their data. Whether you’re a data scientist, analyst, or engineer, observability helps optimize models and identify bottlenecks across your workflows.

Looking to streamline your data transformations even further? Try Hevo Transformer – our native integration with dbt Core helps you transform data faster, with better visibility and less hassle.

Frequently Asked Questions

1. What is dbt observability?

It’s the ability to monitor and understand dbt runs, tests, freshness, and model performance.

2. How do I check if a dbt model failed?

Inspect the run_results.json file to see status, errors, and execution times.

3. How can I identify slow models?

Filter run_results.json for models with long execution times and optimize them.

Maria Asghar Research Analyst

Maria is a Machine Learning and Big Data graduate passionate about data analysis and machine learning. Skilled in data preprocessing, and visualization using SQL, Python,and various libraries, she excels in applying machine learning algorithms to solve business problems and improve decision-making. In her free time, Maria enjoys exploring data science trends and sharing her insights with others.