A large retail company onboarded three new data engineers. What should’ve taken hours getting up to speed with the warehouse took days. Why? Each new hire had to manually recreate the same set of reference tables just to run models. One engineer accidentally used an outdated file and broke the staging environment. Another spent hours comparing lookup tables across environments to figure out which one was the source of truth.
Then someone suggested, “Why not use dbt seed?”
Within a day, they centralized static data, versioned it, and made it available to every environment. No more errors. No more guessing.
That’s the hidden power of dbt seed
. It quietly removes chaos from your analytics workflow by managing the data no one talks about but everyone depends on.
If you’ve ever found yourself hunting for the latest currency conversion file, a product-category map, or ISO country codes, this blog is for you. We’ll walk through what dbt seed
is, why it matters, and how to use it effectively, even if you’re just getting started with dbt.
Table of Contents
What is dbt seed?
dbt seed
is a simple feature that lets you load CSV files into your data warehouse as tables.
These tables often reference tables, static data, or configuration mappings, and are stored in your dbt project’s data/ directory. When you run dbt seed
, dbt loads those CSV files as database tables, making them immediately available for use in models, tests, or macros.
So instead of managing reference data in some dusty corner of an Excel sheet or Google Drive, you keep it version-controlled, tested, and documented, right inside dbt.
-- data/countries.csv
country_code,country_name
US,United States
DE,Germany
FR,France
With a simple command:
dbt seed
You now have a countries table in your data warehouse.
Why Use dbt seed?
You might wonder: “Why not just upload the data manually once and forget about it?”
Here’s why dbt seed is better.
1. Version Control for Static Data
You track your models in Git. Why not your lookup tables?
Using dbt seed, your reference data lives in the same Git repo as your models. That means:
- Changes are visible in pull requests
- You can roll back if something breaks
- Your teammates see exactly what data changed, and why
A financial services firm may use dbt seed
to manage currency conversion rates and mapping tables. With Git as the source of truth, their audit process becomes faster, and change tracking improves transparency across teams.
2. Portability Across Environments
Manual uploads don’t work when you have dev, staging, and production environments. You can’t guarantee consistency.
With dbt seed
, you ensure every environment has the same seed data. The same currencies.csv
file gets deployed with every run, no matter the environment.
dbt Labs themselves recommend dbt seed
as a way to ensure reproducibility across deployments. Whether you’re using dbt Cloud or dbt Core, seeds are environment-agnostic.
3. Faster Onboarding and Collaboration
When you onboard a new data analyst or data scientist, they often get blocked by missing reference data.
With dbt seed
, all necessary static data comes preloaded when they clone the repo and run dbt seed
.
A logistics startup can reduce onboarding time from 2 weeks to 3 days by using dbt seed
to load critical mappings for geographies, shipment types, and product categories.
4. Ideal for Small, Static Datasets
Not every table belongs in a complex ETL pipeline.
Seed files are perfect for:
- Country or region codes
- Categorical mappings (e.g., product tiers, customer types)
- Feature flags
- Historical reference rates
- Time zones, ISO standards
They don’t need complex orchestration. They just need to exist and be reliable.
dbt recommends seed files for datasets under 1MB–5MB in size. If your seed file is getting bigger than that, consider using a dedicated ingestion pipeline or cloud storage solution.
How does dbt seed work?
Let’s walk through using dbt seed in a real dbt project.
1. Create Your Seed File
Inside your dbt project, add a data/ folder if it doesn’t exist.
Add a CSV file:
-- data/customer_tiers.csv
tier_id,tier_name,min_spend
1,Bronze,0
2,Silver,100
3,Gold,500
4,Platinum,1000
2. Run the Seed Command
From the terminal:
dbt seed
This creates a table in your warehouse (e.g., analytics.customer_tiers).
3. Use the Table in Your Models
-- models/enriched_customers.sql
select
c.customer_id,
c.total_spend,
t.tier_name
from {{ ref('raw_customers') }} c
left join {{ ref('customer_tiers') }} t
on c.total_spend >= t.min_spend
Just like that, you’ve turned a static CSV into a reliable input for your transformations.
Advanced Features
1. Seeding with Column Types
You can specify column types to control how dbt loads your CSV data. Use dbt_project.yml:
seeds:
my_project:
customer_tiers:
column_types:
tier_id: integer
tier_name: string
min_spend: float
This prevents type mismatches and ensures your data loads cleanly.
2. Partial Seed Updates (dbt v1.5+)
As of dbt v1.5, you can enable partial updates using updated_at columns. This avoids full table rewrites.
In your seed config:
seeds:
my_project:
customer_tiers:
config:
unique_key: tier_id
updated_at: updated_at
This tells dbt to only update rows that have changed, improving performance and avoiding unnecessary writes, especially useful for large tables with small changes.
When Not to Use dbt seed?
While dbt seed
can be a lifesaver for managing small, static datasets, it’s not the right tool for everything. Misusing it can lead to performance issues, bloated repositories, or unnecessary complexity. Here are the situations where you should avoid using dbt seed
and what to do instead.
1. When the Dataset Is Too Large
If your CSV is larger than 5–10MB or has tens of thousands of rows, it’s too big for dbt seed.
Why?
- Large seed files slow down dbt seed runs.
- Version control systems like Git aren’t designed for huge CSVs.
- Team members may struggle with merge conflicts and slow pulls.
Ingest the data through a proper ELT pipeline (like Hevo or a custom Python/SQL script). Then transform it in dbt as a source, not a seed.
For instance, a marketing analytics team initially seeded a 200k-row campaign_performance.csv file. GitHub choked, dbt slowed to a crawl, and model runs broke across team members. They eventually migrated the data to a BigQuery external table and referenced it via source(), which resolved the issue.
2. When the Data Is Updated Regularly
If your static table gets new rows daily, weekly, or even monthly from an external system or user input, dbt seed is not your best friend.
Why?
- You have to manually update the CSV.
- No built-in automation or connectors.
- It breaks the idea of “static” in static data.
Automate ingestion using your data pipeline or workflow orchestration tool (like Airflow or Dagster), and treat it as a data source.
An e-commerce team tracked shipping zones. Initially, it was seeded as shipping_zones.csv
. But because zones changed frequently due to new fulfillment centers, updating the file became a bottleneck. They switched to pulling updates from their internal logistics system via an ETL tool.
3. When Multiple Teams or Systems Need to Write to It
Seed files are read-only once loaded into your data warehouse; they’re not meant to be updated by applications, external systems, or other teams.
Why?
- You can’t insert/update/delete rows from your app.
- There’s no feedback loop from downstream systems.
- You risk version drift if someone makes local changes.
Store shared reference data in a dedicated database or central API, then sync it into the warehouse through a proper ingestion pipeline.
4. When You Need Granular Access Controls or Privacy
CSV-based seeds are flat files. They don’t offer row-level security, encryption, or fine-grained access controls.
Why?
- Any team member with Git access can see the full file.
- You can’t mask sensitive columns like PII.
Use your data platform’s built-in access control (e.g., Snowflake’s row access policies, BigQuery’s IAM) and load sensitive datasets using governed ingestion tools.
5. When Business Users Want to Edit the Data
If your reference data comes from the business team and changes often, like product categorizations, region mappings, or A/B test flags, CSV files in Git aren’t a user-friendly workflow.
Why?
- Business users don’t want to make pull requests.
- Risk of formatting errors.
- Slows down iteration cycles.
Build an internal tool or use a low-code solution (like Retool or Google Sheets + AppScript) to let non-technical users update data, and then sync that data to your warehouse programmatically.
You can even schedule a job to export the updated table back to Git, if versioning is critical.
Best Practices of dbt seed
File Structure and Organization
- Use the default seeds/ directory (or configure seed-paths) to store all CSV seed files. Place each CSV under seeds/ (dbt defaults to this folder). For clarity, organize seeds in subfolders by domain or use case (e.g. seeds/sales/country_codes.csv). Use clear, consistent file names (e.g. snake_case) without spaces.
- Configure seed settings in YAML. For example, create a seeds/schema.yml or seeds/properties.yml file where you list each seed’s name and any config (such as column types, delimiters, or quoting). This ensures consistent table schemas (e.g. using column_types to preserve leading zeros in ZIP codes). You can also set global configs (like target schema) under the seeds: key in dbt_project.yml.
Data Validation and Testing
- Document and test seeds in YAML. Treat seeds like models: add schema YAML entries for each seed with descriptions and column tests. For example, a seeds/schema.yml may list a seed name, description, and column tests (e.g. unique, not_null, accepted_values) to enforce data quality. Running dbt test will then validate seeded tables just like any model.
- Apply generic or custom tests. Use dbt’s built-in tests (unique, not_null, relationships, accepted_values, etc.) on seed columns. Consider community test packages like dbt-expectations (dbt’s Great Expectations-inspired package) to add rich assertions (row counts, pattern checks, etc.) without writing raw. The key is to catch stale or invalid reference data early.
- Validate seed data at load time (if needed). If seeds come from external sources, consider pre-load checks. For example, use external schema-validation tools or Python-based checks before committing the CSVs, to prevent obvious schema issues. Automated CI steps should include dbt test after dbt seed to ensure integrity.
Final Thoughts
dbt seed might seem like a small feature, but when used right, it can drive huge clarity and consistency in your analytics workflow. Whether you’re loading mapping tables, reference lists, or small static datasets, seeds let you treat data like code — versioned, tested, documented, and deployed with confidence.
But don’t stretch it beyond its purpose. If your seed files start acting like real-time data sources, it’s time to shift to a more scalable ingestion strategy.
Use dbt seed
for what it does best: empowering your team with clean, trustworthy reference data, managed right inside your analytics repo.
Need to scale beyond static seed files? Hevo Transformer integrates seamlessly with dbt Core, letting you transform and manage large-scale, real-time data pipelines with the same level of control and reliability.
Frequently Asked Questions
1. What is dbt seed used for?
dbt seed is used to load small, static reference data (like lookup tables or mappings) from CSV files into your data warehouse. It allows teams to manage and version control static datasets just like any other dbt model.
2. When should I use dbt seed?
Use dbt seed for reference data that doesn’t change frequently. Good examples include country codes, state abbreviations, and product categories. Avoid using it for large or frequently changing datasets, as it’s not optimized for those use cases.
3. Can I use dbt seed for large datasets?
No, dbt seed is best for small datasets. Large datasets can significantly slow down your pipeline, and dbt recommends using other ETL tools to handle big data efficiently.
4. How do I update seed data?
To update a seed, simply modify the CSV file and re-run dbt seed. If you’ve changed the schema (e.g. added/remodeled columns), use the –full-refresh flag to overwrite the table.
5. Can I test the data in my seed files?
Yes! Use dbt’s testing framework. You can define tests for seed data in the schema.yml file, just like you would for regular dbt models. This helps ensure that your reference data remains valid.