Everything You Need to Know About dbt Packages: A Comprehensive Guide

Last Modified: October 13th, 2023

DBT PACKAGES FI

Organizations are adopting advanced analytics tools and technologies to comprehend their data as the world becomes more data-driven. dbt, an open-source software that converts raw data into analytics-ready datasets, has become popular recently. One of dbt’s significant benefit is its ability to create and distribute reusable code through dbt packages. This includes SQL queries, macros, and models. 

This article will explore dbt packages, including their definition, functionality, and benefits for data teams. Regardless of your experience with dbt, this article will help you understand the value of dbt packages and how they can simplify your data transformation process.

What are dbt Packages?

dbt packages: libraries

Software developers often break down their code into smaller modules called libraries. These libraries assist programmers in working more efficiently by allowing them to concentrate on their specific business logic rather than having to write code that someone else has already spent time refining.

In dbt, these libraries are referred to as packages. Many analytical challenges that you might face are common across organizations. dbt packages can help with these challenges, and that makes them very powerful and useful.

The central location for dbt packages is the package hub, which is established by dbt. You can conveniently import packages into your dbt project and keep them updated.

Types of dbt Packages

There are two types of dbt packages:

  • Internal packages: These packages are created and managed within your own dbt project. They are used to organize and modularize your project, making it easier to manage and maintain over time. Internal packages can contain any code or resources that are confidential and specific to your project.
  • External packages: These packages are created and maintained by third-party developers or dbt users and can be easily shared and installed across multiple projects. They are stored in a public registry, such as the dbt Hub, and can be easily installed and managed using the dbt install command. External packages can contain a wide range of resources, including macros, tests, and models, and can be used to extend the functionality of dbt or solve specific data modeling challenges.

Importance of dbt Packages

  • Reusability: Packages allow you to create modular, reusable code that can be shared across multiple dbt projects. This can save you time and effort, as you don’t have to recreate the same code or logic for each project. It also promotes consistency and standardization across your organization’s data models.
  • Consistency: Packages can help you maintain consistency across your data models, allowing you to define and reuse common business logic, data transformations, and tests. This ensures your data models are consistent, accurate, and up-to-date across different projects and teams.
  • Collaboration: Packages can be created and shared by different members of your organization or the dbt community, allowing you to benefit from the collective knowledge and experience of others. This promotes collaboration and knowledge-sharing and can help your organization solve data modeling challenges more efficiently.
  • Scalability: As your organization grows and your data needs become more complex, packages can help you scale your data models more effectively. By creating reusable code and resources, you can avoid manually updating and maintaining every data model as your data sources and business requirements change.
  • Versioning: Packages can be versioned, allowing you to track changes and updates to your code over time. This ensures that you can easily roll back to previous versions if necessary and helps prevent data modeling errors or inconsistencies.

How to Install dbt Packages?

To install dbt packages, follow these steps:

  • Step 1: Create a “packages.yml” file in your dbt project.
  • Step 2: Enter the packages you need and their version number. For example,
packages:
  - package: dbt-labs/snowplow
    version: 0.7.0

  - git: "https://github.com/dbt-labs/dbt-utils.git"
    revision: 0.9.2

  - local: /opt/dbt/redshift
  • Step 3: Run the “dbt deps” command.

All your packages get installed in the dbt_packages directory by default.

There are many dbt packages available for you to choose from. We will discuss a few of them:

  • dbt-utils:  It’s impossible to avoid using this package since it’s the most essential package and is required for many other dbt packages. This package includes macros that apply to various dbt projects and can be utilized repeatedly. It’s the fundamental package that shouldn’t be overlooked. It has multiple functions like equal_row_count, at_least_one, not_empty_string, etc. It also has SQL generators that make SQL coding much simpler, like date_spine, group_by, pivot, unpivot, etc.
  • codegen: Among its many other functions, Codegen can easily produce the .yml files for all of your sources on a schedule using:
{{ codegen.generate_source('<schema>',database_name = '<database>') }}

Or, make the .yml file for a given model:

{{ codegen.generate_model_yaml(model_name='<model>') }}

Input the aforementioned commands into a Scratchpad and click on Build. The output is your complete .yml with proper indentation.

  • audit_helper: This includes functions like 
    • compare_relations: This macro creates SQL that can be used to validate two relations row by row. It is mostly based on the dbt-utils equality test.
    • compare_queries: It is very similar to compare_relations. The only difference is it takes two select statements.
    • compare_column_values: When this macro is used, a query will be returned that compares a column between two queries and counts the number of records that match exactly.

Here are some lesser-known dbt packages you can use:

  • dbt-osmosis: To effectively work with dbt models, it is necessary to manage YAMLs, translate Jinja code into SQL, and compare your models. Even if you are new to the process, you will soon encounter challenges related to naming conventions and replicating model setups from one folder to another. dbt-osmosis is a tool that simplifies all of these tasks. It includes a workbench that allows you to paste your model code and translate it into SQL. Additionally, it offers comparisons using the compiled SQL, the ability to define naming conventions, and even allows for real-time querying using Jinja code.
  • dbt-expectations: This implements tests into your dbt project. You can perform various types of tests such as checking the shape of a table, identifying missing values, examining ranges and sets, and even merging multiple columns into a single test.
  • dbt-hubspot-source: Data models for Hubspot built using dbt.
  • elementary-data: Open-source data observability for analytics engineers.

Advanced Package Configuration

Here are a few advanced package configurations you might need:

  • Updation: When you modify the version or revision of a package in the packages.yml file, the changes will not automatically reflect in your dbt project. You need to execute dbt deps command to update the package and ensure it is up-to-date. A complete refresh of the models in the updated package may also be necessary.
  • Uninstallation: Deleting a package from the packages.yml file will not automatically remove it from the dbt_packages/ directory in your dbt project. To completely uninstall a package, you should choose one of the following options:
    • Delete the directory of the package from the dbt_packages/ directory.
    • Use the dbt clean command to delete all packages and compiled models from the dbt_packages/ directory. Run dbt deps to reinstall any required packages for your dbt project.
  • Configuring Packages: You can configure the models and seeds in a package from the dbt_project.yml file like this:
vars:
  snowplow:
    'snowplow:timezone': 'America/New_York'
    'snowplow:page_ping_frequency': 10
    'snowplow:events': "{{ ref('sp_base_events') }}"
    'snowplow:context:web_page': "{{ ref('sp_base_web_page_context') }}"
    'snowplow:context:performance_timing': false
    'snowplow:context:useragent': false
    'snowplow:pass_through_columns': []

models:
  snowplow:
    +schema: snowplow

seeds:
  snowplow:
    +schema: snowplow_seeds

Any configurations set in the dbt_project.yml file of your dbt project will take precedence over the configurations defined in a package.

Conclusion

There are numerous helpful packages available, and the mentioned list is not exhaustive. These packages have significantly improved the efficiency and resilience of dbt projects. In open-source you can either add to these packages if they lack a feature or create a new package that the community can utilize. Whether you’re a data engineer, data analyst, or data scientist, using dbt packages can help you streamline your data pipeline and make the most of your data assets.

For cases when you rarely need to replicate data, your engineering team can easily do it. Though, for frequent and massive volumes of data transfers from multiple sources, your engineering team would need to constantly monitor and fix any data leaks. Or you can simplify your task with a cloud-based ELT solutions like Hevo Data which automates the data integration process for you and runs your dbt projects to transform data present in your data warehouse. At this time, the dbt Core™ on Hevo is in BETA. Please reach out to Hevo Support or your account executive to enable it for your team.

Visit our Website to Explore Hevo

Offering 150+ plug-and-play integrations and saving countless hours of manual data cleaning & standardizing, Hevo Data also offers in-built pre-load data transformations that get it done in minutes via a simple drag-n-drop interface or your custom python scripts. 

Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

Share your experience of learning about Data Build Tool(dbt)! Let us know in the comments section below!

References

Author

  • mm

    Sharon is a data science enthusiast with a passion for data, software architecture, and writing technical content. She has experience writing articles on diverse topics related to data integration and infrastructure.

No-Code Data Pipeline for Your Data Warehouse