It is essential to keep track of the modifications in data at the source to create a single source of truth with centralization. However, updating and adding data to the target table is not straightforward. It often requires big data tools to scan billions of records to track changes and transform the data. Consequently, organizations use dbt to expedite determining the changes to transform the data. One of the best features of dbt is that it helps transform the data within the data warehouse. In this article, you will learn about dbt Incremental BigQuery model and transformations.
Businesses can transform data and make it accessible to downstream processes by using various data-handling tools in the market. However, dbt makes data transformation simple and fast. dbt is an open-source command line tool often used by data engineers to transform data in warehouses effectively.
With dbt transformation, you have the option to choose how dbt models are materialized. Materialization is just a strategy for data models in the data warehouse. And data models in dbt are SQL SELECT statements that are used to run the query for transformation.
- Table: When materialization is set to table, it runs the transformation once. As a result, the downstream process will have stale transformation data until the dbt model is refreshed.
- View: When materialization is to view, it updates the transformation as the data keeps changing. As a result, the downstream process will always have near real-time data.
- Incremental: When materialization is set to incremental, dbt keeps updating the row-level data to update the target table. As a result, this transformation allows updated data to the downstream processes.
- Ephemeral: When materialization is set to ephemeral, the model interpolates code with another dependent model as a common table expression. Since it can get complex, it is advised to use lightweight transformation with ephemeral.
Incremental Models
Incremental models allow dbt to insert or update records into a target table without leaving the data warehouse. It is usually embraced while working with event-style data. With incremental models, businesses can reduce the build time by transforming new/updated data, where only the latest records will be processed. Based on the complexity and volume of data, updating or appending with incremental models can be processed in a few seconds or minutes.
Suppose a company has a source database consisting of 1 billion records. Any change in the source data would require hours to append or update the target table. This is because the entire data is scanned. However, companies can index and transform only the changed rows using the incremental model to save computational time and resources.
Building an Incremental Model using dbt and BigQuery
Businesses often use dbt with incremental models to either append or update data into tables. Businesses often use dbt with incremental models to append or update data into tables. It reduces the cost of scanning hundreds of GB or TB. To avoid scanning such vast volumes of data, even for a slight change in information, they can use dbt incremental in BigQuery.
Setting up BigQuery for dbt
- Navigate to BigQuery Console and create a BigQuery account (considering you are setting up BigQuery for the first time).
- Click on New Project on the top right to create a project.
- Prove project name and other details like Organization and Location, if necessary. Click Create.
Load Data in BigQuery
- Go to BigQuery Console and copy the below code in the Query Editor to fetch data from public datasets supported by BigQuery.
- Find the project in the picker and click on the three dots to select create dataset.
- Provide a valid Dataset ID to reference the database objects. Create Dataset ID for jafffle_shop and stripe. For now, you can just use the name jaffle_shop as the Dataset ID
- Click on Create Dataset.
- Now, repeat the entire steps (1 to 4) to create a Dataset ID with stripe.
Create a Starter Project
- Create a GitHub repository and name it dbt-tutorial.
- Go to the command line and start version control for jaffle_shop project with the command: dbt init jaffle_shop.
- With any IDE, open the project directory. It includes .sql and .yml files created as a result of the previous step.
- Update the dbt_project.yml file by changing the values of name, profile, and models to “jaffle_shop.”
Connect to BigQuery
The yml file contains all the necessary details to connect with data warehouses like Google BigQuery.
- Create a file in the ~/.dbt/ directory named profiles.yml.
- Move the BigQuery keyfile into the directory.
- Debug to check the connection with:
$ dbt debug
> Connection test: OK connection ok
You can read our article for a more detailed guide to connecting dbt to BigQuery.
Commit Changes to the GitHub Repository
- Execute a run command with dbt run and commit the changes.
Build the First Model
- Open the project in any IDE.
- In the models directory, create a new SQL file named models/customers.sql.
As the dbt materialization is by default in view, the above model generates an output as view. However, to obtain dbt BigQuery incremental transformation, you need to change the materialization to incremental while building the model in a .sql file.
Similar to the view materialization, a dbt BigQuery incremental transformation is a SELECT statement. But, for the SELECT statement to work as an incremental transformation, you must add the following code snippet before the SQL statement.
While using incremental transformation, you would mainly have to use two types of technique: is_incremental() and {{this}}.
- is_incremental(): It is mainly used to transform the recently updated records. For this, you would need to set a unique key that allows you to identify and avoid duplicates. is_incremental() checks whether the record has already been transformed since the last dbt run.
- {{this}}: It is used to query the target table to check for the timestamp of the last modification. This is beneficial while appending new rows to the target table.
The below code is the syntax for dbt incremental transformation in BigQuery.
Type 1: Incremental Model
- Data is present in the data warehouse
- You have event-style data
Dbt incremental BigQuery can be used to calculate the daily active users (DAU) with event stream data. You can identify the DAU of the dbt last run and DAU since the last run.
Type 2: Incremental Model
- Your data source doesn’t store metadata, but you want to start building it
- You would like to track the history of the downstream model
Suppose you want to calculate the sales of your e-commerce company. You run a query to get the income at the end of the month. After a few days, the income changed due to the return of products. In this case, you would have to get the updated income for January. Now, you have two versions of January income. Say, you fixed a bug in your query, and the updated January income is now different from the previous two. At this point, you have three different versions of the January income. What income detail would you provide to the decision-maker?
To simplify this, you can create a snapshot of the final model (fct_income). This can allow end-users to filter based on the criteria they would like to view the January income. However, a snapshot of source data is recommended over the final model as it is less complex. To address such issues, you can create an incremental model downstream of the final model to obtain all the versions of the January income.
The configuration for capturing the incremental history (let’s call it int_income_history) would look like this:
The model can be further updated with additional configurations:
- on_schema_change: handles schema change in case of addition/deletion of new columns
- schema: to setup role-based permission for historical table
- full_refresh: to help prevent accidental loss of the historical data
The final configuration block for int_income_history would be as follows:
Ultimately, you must create a fct_income_history to determine the correct version. The new DAG would be as follows:
And the output of fct_income_history would be:
Rebuild the Table
In case you want to rebuild the data due to a change in the underlying logic to transform, you will have to use the –full-refresh flag in the command line. This will rebuild the table after dropping the existing target table.
Conclusion
Dbt incremental BigQuery enables data engineers and database administrators to transform information without leaving the data warehouse. As a result, it is one of the simplest ways to transform data with the help of SQL queries. Dbt might not be straightforward to use as it needs technical skills to implement. But it helps you to optimize resources and speed up the transformation process. Nevertheless, to simplify the adoption of dbt, you can use dbt cloud instead of working with the command line tool.
For cases when you rarely need to replicate data, your engineering team can easily do it. Though, for frequent and massive volumes of data transfers from multiple sources, your engineering team would need to constantly monitor and fix any data leaks. Or you can simply hop onto a smooth ride with cloud-based ELT solutions like Hevo Data which automates the data integration process for you and runs your DBT projects to transform data present in your data warehouse. At this time, the dbt Core™ on Hevo is in BETA. Please reach out to Hevo Support or your account executive to enable it for your team.
Visit our Website to Explore Hevo
Offering 150+ plug-and-play integrations and saving countless hours of manual data cleaning & standardizing, Hevo Data also offers in-built pre-load data transformations that get it done in minutes via a simple drag-n-drop interface or your custom python scripts.
Want to take Hevo for a spin? Sign Up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.
Share your experience of learning about Data Build Tool(dbt)! Let us know in the comments section below!
Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.