As data volumes continue to grow, organizations seek ways to make sense of it all, and data warehouses are at the center. BigQuery is a popular cloud-based data warehouse that allows for powerful analytics and querying at scale. However, many businesses struggle to effectively clean, standardize, and transform their raw data in BigQuery. This is where dbt (Data Build Tool) comes in. dbt is an open-source tool that allows you to build, test, and maintain data transformations in your data warehouse. 

In this article, we’ll explore how to set up the dbt BigQuery integration and transform data with dbt and BigQuery. We’ll also provide best practices for optimizing performance to ensure your data warehouse runs smoothly and efficiently. 

dbt with BigQuery: 3 Major Features

By setting up the dbt BigQuery connection, your BigQuery data management tasks become effective and efficient by leveraging its salient features. To name a few:

  • dbt Snapshot: dbt allows you to make incremental snapshots of BigQuery tables. You can also monitor and check the freshness of your BigQuery Tables and Views by YAML files using dbt source snapshot-freshness.
  • dbt Docs: Complete metadata documentation of your BigQuery tables and views.
  • dbt Models: Models your BigQuery data using SQL files.
  • dbt Exposure: Effectively manage metadata of data consumers.
  • dbt Analyses: Easily share BigQuery queries.    

However, the most important features of your dbt BIgQuery connection are the following 3:

dbt Source

For handling existing BigQuery Tables and views, you can use dbt sources. You can declare the dbt sources in YAML files and refer to them by the {{ source() }} macros. 

The dbt BigQuery connection allows performing quality checks even for existing BigQuery data. Compared to Airflow’s check operators that require you to write SQL queries,  you can use dbt pre-built tests named unique, not_null, accepted_values, and relationship. You also have the option to create custom tests using SQL and 3rd party packages, as fishtown-analytics/dbt-utils is very convenient. In the sample code given below, unique and not_null tests are performed on the id column.

Using dbt’s freshness check blog, you can monitor the data freshness of the user table. For instance, if data is not updated completely by the data loader, your data freshness job scheduled for this check will raise an alert.

version: 2
sources:
  - name: production
    tables:
      - name: users
        loaded_at_field: updated
        freshness:
          warn_after:
            count: 12
            period: hour
          error_after:
            count: 24
            period: hour
        description: |
          ## Overview
          The table contains users data.
        labels:
          owner: "customer_data_team"
          data_source: "MySQL"
        columns: 
          - name: id
            description: "user ID"
            tests:
              - unique:
              - not_null 

dbt Models

While creating dbt BigQuery models, you would often need intermediate such as creating a table using user attributes. For instance, you need denormalized user information, such as when the user first purchased. This would require writing complex queries that are often difficult to maintain.

This is where dbt saves the day! dbt allows you to take control of complicated dependencies by dividing intermediate tables. 

With dbt’s jinja2 template engine, you can use jinja2 features and the macros provided by dbt. The macro config allows you to control the dbt model’s features. For instance, materialized is a strategy to persist dbt models in a warehouse, whereas {{ source(‘product”, ‘transactions’) }} is a macro to refer to the transactions table as a dbt source.

-- first_purchases.sql
{{
  config(
    materialized="table",
    alias="first_purchases",
    owner="data_analytics_team",
    persist_docs={"relation": true, "columns": true},
    tags=["daily"],
  )
}}

SELECT
  buyer_id AS user_id
  , MIN(created_at) AS first_purchases
FROM {{ source('product', 'transactions') }}
WHERE
  status = "done"
GROUP BY 1

You can also check intermediate tables and views with dbt via YAML files. For instance, checking that if the user_id_column of first_purchases table is unique.

version: 2
models:
  - name: first_purchase
    labels:
      owner: data_analytics_team
      interval: daily
      status: experimental
    description: |
      ## Overview
      The table contains the first purchase timestamp by user.
    columns:
      - name: user_id
        description: "user ID"
        tests:
          - unique
          - relationships:
              to: source('product', 'users')
              field: id
      - name: first_purchase
        description: "first purchase timestamp"
        tests:
          - not_null

Finally, after creating multiple intermediate tables associated with user attributes, you can join them together and save the results to an intermediate table whose name is user_attributes. Here, we use the ref function ({{ ref() }} is a macro) to refer to intermediate tables modeled by dbt.

-- user_attributes.sql
{{
  config(
    materialized="table",
    alias="user_attributes",
    persist_docs={"relation": true, "columns": true},
    tags=["daily", "only_prod"],
  )
}}
SELECT
  u.*
  , first_purchase.* EXCEPT (user_id)
  , first_listing.* EXCEPT (user_id)
FROM {{ source('product', 'users') }} AS u
LEFT OUTER JOIN {{ ref('first_purchase') }} AS first_purchase
  ON u.id = first_purchase.user_id
LEFT OUTER JOIN {{ ref('first_listing') }} AS first_listing
  ON u.id = first_listing.user_id

dbt CLI

dbt core, i.e., its open-source tool, offers a command line interface. Some common commands you can execute for your dbt BigQuery models are

  • dbt run: Executes all dbt models defined in SQL files.
  • dbt test: Executes all tests defined in YAML files.
  • dbt source snapshot-freshness: Check data freshness in YAML files.

dbt also completely automates the running of queries in the right order by interpreting dependencies with the {{ source() }} and {{ ref() }} macros. For example, if model B depends on model A, dbt automatically runs model A and then model B. This is true for most cases except when the dependencies are not cyclic.

As shown in the data lineage made by dbt, the {{ source() }} and {{ ref() }} macros automatically manage the complicated dependencies. 

dbt BigQuery - Data Lineage
Image Source

With the easy model selection syntax, you can also select and run only specific models also. You can pass conditions to the –models option by adding the + prefix syntax to create all intermediate tables required to create user_attributes. You can also simply declare the condition in a YAML file selectors.yml.

$ dbt run --models +models/user_attributes.sql

How to set up dbt BigQuery Connection?

dbt offers a command line interface-based open source tool dbt Core and dbt cloud that offers more functionalities and ease of use. 

To install dbt core on BigQuery, follow these steps:

  • Step 1: Open ‘Cloud Shell’ from the GCP dashboard and run the following command:
pip install dbt-bigquery
  • Step 2: Create a service account for dbt to use and allow that service account to create BigQuery tables and run queries. For this step, you can check out the detailed stepwise guide for connecting dbt to BigQuery.
  • Step 3: Copy a profile and service account key file into ~/.dbt
  • Step 4: Create a new dbt project

If you opt for dbt Cloud, follow these steps to set up the dbt BigQuery Connection:

  • Step 1: Create a dbt account if not done already. Log in, click on the hamburger icon, and click on Projects.
  • Step 2: Click the New Project button to create a new one or choose from the samples.
dbt BigQuery - New Project Button
Image Source
  • Step 3: Go to GCP to create a service account with IAM roles BigQuery Job User and BigQuery User using the API credentials Wizard. Download the JSON file containing the BigQuery credentials.
  • Step 4: On your dbt Cloud page, when asked to select a data platform, click on BigQuery. Click on the Upload a Service Account JSON file to read the BigQuery settings from the JSON file. Next, click the Test button to check that the dbt BigQuery connection works.
  • Step 5: Provide the name of your dataset.
  • Step 6: You can now choose which repository to choose or go with git managed repository. This completes your dbt BigQuery Connection.       

Build dbt BigQuery Data Pipelines

As an example, this dbt BigQuery article considers BigQuery’s public datasets on London bicycle hires and stations. Follow the simple steps given below to transform your data using dbt:

  • Step 1: For instance, consider you need to analyze the bike ride activity for the year 2017. You can extract the specific fields of the cycle_hire table from 2017 using the following code. Firstly, you need a create an SQL file, i.e., raw_bike_hires.sql, in the model directory and then paste this code into it. 
SELECT 
    rental_id
    , duration as duration_seconds
    , duration / 60 as duration_minutes
    , bike_id
    , start_date
    , start_station_id
    , start_station_name
    , end_date
    , end_station_id
    , end_station_name
FROM `bigquery-public-data.london_bicycles.cycle_hire` 
WHERE EXTRACT(year from start_date) = 2017
  • Step 2: Similarly, you can extract specific data for bike stations by creating an SQL file,i.e., raw_bike_stations.sql ,and pasting the following code:
SELECT 
    id
    , name as station_name
    , bikes_count
    , docks_count
    , install_date
    , removal_date
FROM `bigquery-public-data.london_bicycles.cycle_stations`
WHERE install_date < '2017-01-01' and (removal_date < '2018-01-01' or removal_date is null)
  • Step 3: For the bike ride analysis, we need a few additional fields such as total minutes, average duration, total bikes hired, and checking the bike was hired during peak hours. By taking data from our raw_bike_hires.sql created in the first step, you can use the following code to add these new fields. You can paste this code in a newly created SQL file, i.e., cleaned_bike_rides.sql
SELECT
    SUM(duration_minutes) as total_minutes
    , COUNT(rental_id) as total_bike_hires
    , ROUND(SUM(duration_minutes) / COUNT(rental_id), 2) AS average_duration
    , EXTRACT(month from start_date) as month
    , CASE
        WHEN EXTRACT(HOUR from TIMESTAMP(start_date)) >= 6 AND EXTRACT(HOUR from TIMESTAMP(start_date)) <= 10 THEN 'Morning Peak'
        WHEN EXTRACT(HOUR from TIMESTAMP(start_date)) >= 16 AND EXTRACT(HOUR from TIMESTAMP(start_date)) <= 19 THEN 'Evening Peak'
        ELSE 'Off-Peak'
      END AS start_peak_travel
    , IF(start_station_id = end_station_id, True, False) as same_station_flag
    , start_station_id
    , start_station_name
    , end_station_id
    , end_station_name
FROM {{ ref('raw_bike_hires') }}
GROUP BY 4,5,6,7,8,9,10
ORDER BY total_minutes DESC
  • Step 4: You can now join the tables for the final table with aggregated data by journeys with the same stations, month, and time of day. You can create a new SQL file ,i.e., 2017_rides_by_month.sql, and paste the following code:
WITH stations AS (
    SELECT *
    FROM {{ ref('raw_bike_stations') }}
),

rides AS (
    SELECT *
    FROM {{ ref('cleaned_bike_rides') }}
),

start_stat_join AS (
    SELECT rides.*
    , stations.bikes_count as start_station_bikes_count
    , stations.docks_count as start_station_docks_count
    , stations.install_date as start_station_install_date
    FROM rides
    LEFT JOIN stations
    ON rides.start_station_id = stations.id
)

SELECT 
    total_minutes 
    , total_bike_hires 
    , average_duration 
    , month 
    , start_peak_travel
    , same_station_flag
    , start_station_id
    , start_station_name
    , start_station_bikes_count 
    , start_station_docks_count 
    , start_station_install_date 
    , end_station_id
    , end_station_name
    , stations.bikes_count as end_station_bikes_count
    , stations.docks_count as end_station_docks_count
    , stations.install_date as end_station_install_date
FROM start_stat_join
LEFT JOIN stations
ON start_stat_join.end_station_id = stations.id
  • Step 5: Finally, you have to update and provide the settings for your dbt project configuration file, i.e., the YAML file in your dbt project folder. This file tells dbt how to operate your dbt project. You can use the following code and overwrite it on your YAML file.
name: 'london_bike_analysis'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'london_bike_analysis'

# These configurations specify where dbt should look for different types of files.
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
    - "target"
    - "dbt_modules"

models:
  london_bike_analysis:
    +materialized: table
  • Step 6: You can run the command dbt run, and it will create all the models. Your BigQuery will now have the following tables:
DBT BigQuery - New Datasets
Image Source

dbt BigQuery Best Practices 

To get your dbt BigQuery connection to work optimally, you can go through the following recommendations:

  • Set up priorities for your BigQuery jobs. This allows for better server resource allocation, especially under heavy loads. 
  • When working with larger datasets, you can extend the default timeout period of 300 ms to adjust your server resources.
  • dbt also allows you to set the times a command should be retried before returning an error and the delay between retries. As a good practice, a command should be retired three to five times before returning an error. 
  • Try configuring your .sql file to be leaner for CPU use or simply solve the performance issue via scheduling.
  • Avoid using self-joins to prevent multiplying the number of output rows. 
  • You can use yourTable.yourModel to define a specific source for the query and prefilter rows before your datasets reach BigQuery.
  • Using GROUP By offers a more flexible dbt implementation. By pre-aggregating, you can avoid cross-joins and Cartesian products.
  • Hooks like On-run-start let you check if the required table exists and then tells dbt to create a table if it doesn’t automatically. On-run-end, on the other hand, is perfect for granting schemas.
  • Seeds are handy for performing specific tasks, like excluding certain IP addresses or email addresses in your query. Seeds act like a list in a CSV file. For example, you can create a list of excluded addresses and add the CSV file to your dbt project.
  • Use jinja to quickly and effortlessly run expressions, statements, and comments, generate macros and use references on the fly.

Conclusion

With this article, you know how to start with the dbt BigQuery connection and build data models for business requirements. You also learned to optimally use dbt BigQuery integration by following best practices. Using dbt on top of your data warehouse makes the data transformation process more effective and effortless. However, you still need to pull data from all your sources and load them into your data warehouse. 

For cases when you rarely need to replicate data, your engineering team can easily do it. For frequent and massive volumes of data transfers from multiple sources, your engineering team would need to constantly monitor and fix any data leaks. Or you can simply hop onto a smooth ride with cloud-based ELT solutions like Hevo Data which automates the data integration process for you and runs your dbt projects to transform data present in BigQuery. At this time, the dbt Core™ on Hevo is in BETA. Please get in touch with Hevo Support or your account executive to enable it for your team.

Visit our Website to Explore Hevo

Offering 150+ plug-and-play integrations and saving countless hours of manual data cleaning & standardizing, Hevo Data also offers in-built pre-load data transformations that get it done in minutes via a simple drag-n-drop interface or your custom python scripts. 

Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

Share your experience learning about dbt BigQuery integration and effectively transforming data using dbt with BigQuery! Let us know in the comments section below!

All your customer data in one place.