dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively using SQL and software engineering best practices. It transfers the transformation logic from complicated ETL channels to simple SQL-based workflows inside modern cloud data warehouses such as Redshift, Snowflake, and BigQuery. Functions include converting code into raw SQL and running the compiled SQL against a configured data warehouse.

In recent data engineering, reproducibility, cost optimization, and efficiency are essential when managing data transformation workflows. dbt has become a cornerstone that helps data analysts and data engineers test, transform, and document data using code. One of its underutilized yet powerful features is the Clone feature. dbt clone facilitates the creation of database clones by leveraging the native cloning capabilities of supported data platforms like Snowflake, BigQuery, and Databricks.

What is a Clone?

dbt clone is a command or feature that helps create a replica of selected models from a defined state to a target schema. It efficiently leverages the underlying database’s cloning capabilities to duplicate tables, schemas, and even datasets instantly, with minimal storage overhead. This allows for quick and cost-effective replication of data warehouse objects, useful for deployments and updating development environments.

Key Characteristics of a Clone

The main characteristics of clone include the following:

  • Version Dependency: Requires dbt v1.6+ and works only with warehouses that support zero-copy cloning.
  • State Comparison Integration: Works with artifacts to detect and clone only the necessary models from the reference environment.
  • Non-Destructive Operation: Preserves existing relations by default, with an option for full refresh to force recreation.
  • BI Tool Compatibility: Builds physical database objects, enabling validation in downstream tools.
  • Cross-Environment Cloning: Clones database models from a particular state to the target schema, supporting duplication across development and deployment environments.
  • CI/CD Optimization: Enables testing against a complete copy of production-like data without requiring full rebuilds of models.
  • Materialization Flexibility: Uses the database’s native clone functionality when available and falls back to table copying when native cloning is not supported.
  • Execution Control: Does not support full refresh in the same way dbt run does.

Difference between clone and run

The main difference between dbt clone and dbt run is in the purpose, mechanism, and performance within the dbt project.

Featuredbt clonedbt run
Primary PurposeCreate copies of existing modelsTransform data by executing model SQL
Data ProcessingNo data processing occursData is processed according to model logic
Underlying MechanismLeverages database cloning featuresExecutes SQL queries against the database
PerformanceDepends on the database cloning implementationDepends on model complexity and data volume
Typical Use CaseEnvironment creation, testing, CI/CDRegular data transformation pipelines
Storage ImpactMinimal for zero-copy clonesFull storage for materialized results

Benefits of Cloning

dbt clone has lots of benefits that are mainly centered on the following:

  • Migration Flexibility: clone streamlines database migration for compatibility testing through cloned environments.
  • Modular Workflows: It allows developers to source production data for unmodified upstream models instead of recreating them.
  • Rapid Onboarding: It allows new team members to clone production schemas quickly, rather than waiting for full model rebuilds.
  • Cost Effective: clone Influences zero-copy cloning to create models without replicating data, thereby reducing storage costs and increasing the environment setup.
  • Safe Development: It prevents data drift in the production environments by allowing testing modifications in an isolated clone.
  • Cooperation and Validation:  clone allows business stakeholders, data engineers, and analysts to work in a shared environment, promoting collaboration. It also offers centralized access control and detailed documentation by ensuring compliance with industry standards.

Use Cases of Clone

1. CI/CD Pipeline Testing

Purpose: Enable efficient testing of cumulative models inside the CI/CD pipelines.

How it works: The clone grants the testing of cumulative model changes without the cost of a full refresh after a merge to restore the state of the data.

2. Incremental Models

Purpose: Testing cumulative models without a full refresh.

How it works: It permits testing of cumulative model changes without the cost of a full refresh when dbt clone is used to stimulate the state of the data after merging.

3. Staging Warehouses

Purpose: Generating separate environments for testing staging datasets in BI tools.

How it works: A duplicate of the production dataset is made available in the downstream BI tool, allowing safe experimentation without affecting production data.

4. Development and Testing

Purpose: provides a safe sandbox for developers to test their dbt models without disrupting live data

How it works: It enables developers to experiment with code changes and data transformation without risking the unification of the main data warehouse when a replica of production schemas or tables is created.

5. Blue/Green Deployments

Purpose: Facilitates simple transformation between the development environments and production without downtime or data loss.

How it works: Testing and validation of new code changes are authorized when a replica of the production environment is created in development. The blue/green deployment strategy can be executed successfully once the replicated environment is swapped with the production environment.

Limitations of  Clone

  • Cost Considerations: Since clone can be used to build efficient CI/CD pipelines, it may also incur overhead in terms of resources and model building, potentially leading to higher costs in cloud warehouses.
  • Correlation and Scheduling: Correlation and scheduling need to be integrated with external tools, as clone does not provide built-in advanced scheduling or orchestration capabilities.
  • Batch Processing Focus: clone is primarily designed for batch processing, making it less suitable for real-time data pipelines where instant data availability is critical.
  • Warehouse Support: clone relies on zero-copy duplication, a feature not supported by all data warehouses.

Considerations of  Clone

  • Cost Optimization: dbt clone can help optimize costs by enabling systematic CI/CD pipelines and reducing the need for full refresh builds.
  • Integration with Other Tools: Integrating dbt clone with other tools for monitoring, scheduling, and orchestration ensures a smooth and reliable data workflow.
  • Testing Incremental Models: dbt clone Supports testing incremental models by duplicating them as the first step in a CI job, avoiding expensive full refresh builds in warehouses that support zero-copy cloning.
  • CI/CD Pipelines: dbt clone is a valuable tool for building CI/CD pipelines when combined with testing and documentation steps, ensuring that data changes are safe to deploy.
  • Blue/Green Deployments: It can facilitate blue/green deployments by creating a staging dataset that is tested and promoted to production only if all validations pass, reducing both downtime and deployment risk.

Best Practices for Using dbt Clone Effectively

Practices like this help to keep clone efficiency while preventing common threats. Companies that apply it normally improve production stability, reduce development cycle times, and lower testing costs.

The practices include the following:

  • Strategic clone naming convention
  • Lifecycle management
  • Cost control measures
  • Development workflow upgrading
  • CI/CD execution
  • Security and governance
  • Performance upgrading
  • Validation and lineage
  • Platform-specific upgrading

Real World Examples of dbt Clone

1. Development Sandboxing at a FinTech Company

Challenge: Analysts needed to test new swindling detection logic without affecting production models

Solution: Create pre-developer clones of the core models, each clone uses a naming convention, then modify the clones with experimental logic while production remains untouched.

2. CI/CD Pipeline for Healthcare

Challenges: Similar test domains were required for 20+ PR documentations.

Solution: Run tests to check the replicated subsets, duplicate production models using state collation, and auto-expire duplicates after some time.

3. Retail Chain’s Promotion Analysis

Challenges: Required to collate holiday promotion scenarios together.

Solution: Created three clones of sales models, implemented various discount logic in each, and ran modified analysis.

4. Data Recovery at SaaS Startup

Challenge: Modify the financial report when a random full refresh was run.

Solution: Linked the last good state via the dbt doc. generate, validate data integration, use Snowflake time travel cloning, and change production reference.

5. Manufacturing Quality Analysis

Challenge: Comparing defect rates across factory locations

Solution: Cloned production quality, applied plant-specific thresholds, and created modified reports.

Common Clone Issues

  • Clone generation failures
  • Data discrepancy problems
  • Production debasement
  • CI/CD pipeline failures
  • Storage problems
  • Dependency issues
  • Time travel restriction

Future Enhancements for  Clone

These enhancements could elevate clone from a tactical feature to a strategic data management capability. Potential improvements include:

  • Simplified User Experience: Provide more informative output about the cloning process, such as the number of models cloned, any errors encountered, and relevant metadata. Also, introduce tools or workflows to easily track and manage clones.
  • Enhanced CI/CD Integration: The ability to automatically duplicate source data during the build process could streamline development workflows, especially for large projects.
  • Advanced Clone Types: Enable filtering during cloning to selectively clone only specific or sensitive data.
  • Intelligent Clone Management: Support for automatically selecting the optimal cloning method and incorporating time-to-live (TTL) settings directly within dbt.
  • Flexible Clone Standardization: Leverage Snowflake’s zero-copy cloning capabilities to create more structured and cost-effective development environments.

Conclusion

The clone feature has emerged as a transformative capability for modern data teams, fundamentally changing how organizations approach development, testing, and environment management. It leverages zero-copy cloning when supported by the data platform or falls back to creating views when it’s not. dbt clone provides three primary advantages when using native data cloning features through a standardized interface: faster development cycles, improved performance, and enhanced data reliability.

Take it a step further with Hevo Transformer — our powerful dbt Core integration that lets you transform data directly in your warehouse with zero setup. Automate, manage, and scale your transformations effortlessly. Learn more about Hevo Transformer.

FAQs

1. How is a clone different from a deferral?

The clone generates real models in the warehouse and is important when you need exact copies of objects for testing or in environments outside of the warehouse. While deferral only rephrases references inside the model’s SQL, and is mostly less expensive in terms of evaluation.

2. What is the aim of cloning sources?

The main aim is to ensure that both environments (CI and production) are utilizing the same input data replica. This is mostly relevant for CI jobs to confirm that testing of code changes does not break the transformation process.

3. What does the clone do?

The clone duplicates database models like tables/views from a source schema to an aim schema on platforms like BigQuery, Snowflake, or Databricks.

Asimiyu Musa
Data Engineering Expert

Asimiyu Musa is a certified Data Engineer and accomplished Technical Writer with over six years of extensive experience in data engineering and business process development. Throughout his career, Asimiyu has demonstrated expertise in building, deploying, and optimizing end-to-end data solutions.