dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively using SQL and software engineering best practices. It transfers the transformation logic from complicated ETL channels to simple SQL-based workflows inside modern cloud data warehouses such as Redshift, Snowflake, and BigQuery. Functions include converting code into raw SQL and running the compiled SQL against a configured data warehouse.
In recent data engineering, reproducibility, cost optimization, and efficiency are essential when managing data transformation workflows. dbt has become a cornerstone that helps data analysts and data engineers test, transform, and document data using code. One of its underutilized yet powerful features is the Clone feature. dbt clone facilitates the creation of database clones by leveraging the native cloning capabilities of supported data platforms like Snowflake, BigQuery, and Databricks.
Table of Contents
What is a Clone?
dbt clone
is a command or feature that helps create a replica of selected models from a defined state to a target schema. It efficiently leverages the underlying database’s cloning capabilities to duplicate tables, schemas, and even datasets instantly, with minimal storage overhead. This allows for quick and cost-effective replication of data warehouse objects, useful for deployments and updating development environments.
Key Characteristics of a Clone
The main characteristics of clone
include the following:
- Version Dependency: Requires dbt v1.6+ and works only with warehouses that support zero-copy cloning.
- State Comparison Integration: Works with artifacts to detect and clone only the necessary models from the reference environment.
- Non-Destructive Operation: Preserves existing relations by default, with an option for full refresh to force recreation.
- BI Tool Compatibility: Builds physical database objects, enabling validation in downstream tools.
- Cross-Environment Cloning: Clones database models from a particular state to the target schema, supporting duplication across development and deployment environments.
- CI/CD Optimization: Enables testing against a complete copy of production-like data without requiring full rebuilds of models.
- Materialization Flexibility: Uses the database’s native clone functionality when available and falls back to table copying when native cloning is not supported.
- Execution Control: Does not support full refresh in the same way
dbt run
does.
Difference between clone and run
The main difference between dbt clone and dbt run is in the purpose, mechanism, and performance within the dbt project.
Feature | dbt clone | dbt run |
Primary Purpose | Create copies of existing models | Transform data by executing model SQL |
Data Processing | No data processing occurs | Data is processed according to model logic |
Underlying Mechanism | Leverages database cloning features | Executes SQL queries against the database |
Performance | Depends on the database cloning implementation | Depends on model complexity and data volume |
Typical Use Case | Environment creation, testing, CI/CD | Regular data transformation pipelines |
Storage Impact | Minimal for zero-copy clones | Full storage for materialized results |
Benefits of Cloning
dbt clone
has lots of benefits that are mainly centered on the following:
- Migration Flexibility:
clone
streamlines database migration for compatibility testing through cloned environments. - Modular Workflows: It allows developers to source production data for unmodified upstream models instead of recreating them.
- Rapid Onboarding: It allows new team members to clone production schemas quickly, rather than waiting for full model rebuilds.
- Cost Effective:
clone
Influences zero-copy cloning to create models without replicating data, thereby reducing storage costs and increasing the environment setup. - Safe Development: It prevents data drift in the production environments by allowing testing modifications in an isolated clone.
- Cooperation and Validation: clone allows business stakeholders, data engineers, and analysts to work in a shared environment, promoting collaboration. It also offers centralized access control and detailed documentation by ensuring compliance with industry standards.
Use Cases of Clone
1. CI/CD Pipeline Testing
Purpose: Enable efficient testing of cumulative models inside the CI/CD pipelines.
How it works: The clone grants the testing of cumulative model changes without the cost of a full refresh after a merge to restore the state of the data.
2. Incremental Models
Purpose: Testing cumulative models without a full refresh.
How it works: It permits testing of cumulative model changes without the cost of a full refresh when dbt clone is used to stimulate the state of the data after merging.
3. Staging Warehouses
Purpose: Generating separate environments for testing staging datasets in BI tools.
How it works: A duplicate of the production dataset is made available in the downstream BI tool, allowing safe experimentation without affecting production data.
4. Development and Testing
Purpose: provides a safe sandbox for developers to test their dbt models without disrupting live data
How it works: It enables developers to experiment with code changes and data transformation without risking the unification of the main data warehouse when a replica of production schemas or tables is created.
5. Blue/Green Deployments
Purpose: Facilitates simple transformation between the development environments and production without downtime or data loss.
How it works: Testing and validation of new code changes are authorized when a replica of the production environment is created in development. The blue/green deployment strategy can be executed successfully once the replicated environment is swapped with the production environment.
Limitations of Clone
- Cost Considerations: Since
clone
can be used to build efficient CI/CD pipelines, it may also incur overhead in terms of resources and model building, potentially leading to higher costs in cloud warehouses. - Correlation and Scheduling: Correlation and scheduling need to be integrated with external tools, as
clone
does not provide built-in advanced scheduling or orchestration capabilities. - Batch Processing Focus:
clone
is primarily designed for batch processing, making it less suitable for real-time data pipelines where instant data availability is critical. - Warehouse Support:
clone
relies on zero-copy duplication, a feature not supported by all data warehouses.
Considerations of Clone
- Cost Optimization:
dbt clone
can help optimize costs by enabling systematic CI/CD pipelines and reducing the need for full refresh builds. - Integration with Other Tools: Integrating
dbt clone
with other tools for monitoring, scheduling, and orchestration ensures a smooth and reliable data workflow. - Testing Incremental Models:
dbt clone
Supports testing incremental models by duplicating them as the first step in a CI job, avoiding expensive full refresh builds in warehouses that support zero-copy cloning. - CI/CD Pipelines:
dbt clone
is a valuable tool for building CI/CD pipelines when combined with testing and documentation steps, ensuring that data changes are safe to deploy. - Blue/Green Deployments: It can facilitate blue/green deployments by creating a staging dataset that is tested and promoted to production only if all validations pass, reducing both downtime and deployment risk.
Best Practices for Using dbt Clone Effectively
Practices like this help to keep clone efficiency while preventing common threats. Companies that apply it normally improve production stability, reduce development cycle times, and lower testing costs.
The practices include the following:
- Strategic clone naming convention
- Lifecycle management
- Cost control measures
- Development workflow upgrading
- CI/CD execution
- Security and governance
- Performance upgrading
- Validation and lineage
- Platform-specific upgrading
Real World Examples of dbt Clone
1. Development Sandboxing at a FinTech Company
Challenge: Analysts needed to test new swindling detection logic without affecting production models
Solution: Create pre-developer clones of the core models, each clone uses a naming convention, then modify the clones with experimental logic while production remains untouched.
2. CI/CD Pipeline for Healthcare
Challenges: Similar test domains were required for 20+ PR documentations.
Solution: Run tests to check the replicated subsets, duplicate production models using state collation, and auto-expire duplicates after some time.
3. Retail Chain’s Promotion Analysis
Challenges: Required to collate holiday promotion scenarios together.
Solution: Created three clones of sales models, implemented various discount logic in each, and ran modified analysis.
4. Data Recovery at SaaS Startup
Challenge: Modify the financial report when a random full refresh was run.
Solution: Linked the last good state via the dbt doc. generate, validate data integration, use Snowflake time travel cloning, and change production reference.
5. Manufacturing Quality Analysis
Challenge: Comparing defect rates across factory locations
Solution: Cloned production quality, applied plant-specific thresholds, and created modified reports.
Common Clone Issues
- Clone generation failures
- Data discrepancy problems
- Production debasement
- CI/CD pipeline failures
- Storage problems
- Dependency issues
- Time travel restriction
Future Enhancements for Clone
These enhancements could elevate clone
from a tactical feature to a strategic data management capability. Potential improvements include:
- Simplified User Experience: Provide more informative output about the cloning process, such as the number of models cloned, any errors encountered, and relevant metadata. Also, introduce tools or workflows to easily track and manage clones.
- Enhanced CI/CD Integration: The ability to automatically duplicate source data during the build process could streamline development workflows, especially for large projects.
- Advanced Clone Types: Enable filtering during cloning to selectively clone only specific or sensitive data.
- Intelligent Clone Management: Support for automatically selecting the optimal cloning method and incorporating time-to-live (TTL) settings directly within dbt.
- Flexible Clone Standardization: Leverage Snowflake’s zero-copy cloning capabilities to create more structured and cost-effective development environments.
Conclusion
The clone
feature has emerged as a transformative capability for modern data teams, fundamentally changing how organizations approach development, testing, and environment management. It leverages zero-copy cloning when supported by the data platform or falls back to creating views when it’s not. dbt clone
provides three primary advantages when using native data cloning features through a standardized interface: faster development cycles, improved performance, and enhanced data reliability.
Take it a step further with Hevo Transformer — our powerful dbt Core integration that lets you transform data directly in your warehouse with zero setup. Automate, manage, and scale your transformations effortlessly. Learn more about Hevo Transformer.
FAQs
1. How is a clone different from a deferral?
The clone generates real models in the warehouse and is important when you need exact copies of objects for testing or in environments outside of the warehouse. While deferral only rephrases references inside the model’s SQL, and is mostly less expensive in terms of evaluation.
2. What is the aim of cloning sources?
The main aim is to ensure that both environments (CI and production) are utilizing the same input data replica. This is mostly relevant for CI jobs to confirm that testing of code changes does not break the transformation process.
3. What does the clone do?
The clone duplicates database models like tables/views from a source schema to an aim schema on platforms like BigQuery, Snowflake, or Databricks.