Data analysts and engineers often struggle with writing and managing complex SQL queries.
Challenges like debugging scripts with 200+ lines of messy code, dealing with code duplication, and understanding each other’s SQL logic are a major headache for SQL developers.
These queries become more complicated with growing data management needs. Engineers spend hours deciphering queries, subqueries, joins, and nested logics that are likely to become technical debt.
To tackle this, experts seek a better approach to writing SQL. Modular SQL is the answer to simplified queries.
It assembles SQL into understandable, reusable blocks of code in Common Table Expressions (CTEs), views, and functions. These logical blocks can be modified, debugged, and maintained independently without disturbing other relevant code.
Data teams can minimize the time spent on debugging and fix messy SQL queries, leading to enhanced data transformation and collaboration throughout the data pipeline.
This article discusses the fundamentals of how to write modular SQL using practical examples, tools, and best practices. Learn how to use CTEs to break down large SQLs into manageable code and create reusable logic using views and functions.
Table of Contents
Understanding Modular SQL
Nearly 30% of SQL execution time is wasted on managing inefficient joins and indexing.
Modular SQL mitigates this by separating logic into reusable, testable blocks that are easier to optimize.
What is modular SQL?
Modular SQL takes an organized approach to modern data architecture query writing.
It separates logical and reusable components from lengthy code scripts and rearranges them using CTEs, user-defined functions (UDFs), and external transformational tools.
These elements enable SQL developers to encapsulate unique bits of logic together into manageable blocks of code. The blocks can further serve as building components for more complex queries.
Using modular SQL not only improves readability and maintenance of code but also simplifies testing, debugging, and reusability across data pipelines.
Why does it matter for data teams?
Traditional SQL queries may be functional in managing data. However, as the queries become longer, they often become unorganized and complicated.
This may increase the risk of logical errors and lead to time-consuming maintenance. Modular SQL can overcome these difficulties by adopting a more organized and scalable approach:
1. Enhanced readability
When developers write modular SQL, it becomes easier to understand, debug, and modify data, as these queries are simpler to follow through the code.
2. Reusability
Data teams can create CTEs, functions, and views that can be reused in subsequent queries, thus minimizing the need to copy and paste the same logic across multiple queries and also eliminating redundancy.
3. Team collaboration
Developers mostly struggle to understand or modify a colleague’s SQL code.
Implementing modular SQL allows team members to work on separate SQL modules independently without bugging into each other’s code.
4. Simplified testing
Developers may test and verify each block independently, which helps to detect errors early and provides consistent results.
In simple terms, modular SQL is necessary for maintaining clean, adaptable, and ready-to-use SQL code blocks, especially in collaborative data teams.
The 3-level overview
Modular SQL can be divided into three levels, each of which focuses on readability, reusability, and scalability of queries.
At Level 1, developers use Common Table Expressions (CTEs) to break down complicated SQL into logical, in-query sections, allowing your code to be easier to comprehend and test.
Level 2 involves generating persistent, reusable logic at the database level.
And at level 3, data engineers can use modern transformation tools like Hevo to organize and structure SQL activities within automated data pipelines.
Now, let’s understand all three levels in detail:
Level 1: CTEs for Query Modularity
CTEs are foundational tools for implementing modular SQL, enabling developers to streamline complex queries. The CTE fundamentals include:
- Common Table Expressions are temporary result sets defined throughout the query using the WITH clause.
- Unlike permanent views, CTEs exist just during the query’s execution and are only accessible within that query.
- They act similarly to subqueries, but provide an easier and accessible approach to managing SQLs by breaking down complicated logic into smaller, readable blocks.
Here are three fundamental implementation patterns that address distinct data processing scenarios with enhanced readability and maintenance:
1. Simplifying complex queries with CTEs
CTEs divide complex SQL queries into logical, manageable building blocks. The user doesn’t have to write a single query with multiple nested subqueries.
The workflow:
- Each CTE is a defined step in the data pipeline. The pipeline filters data in one CTE and then uses its result in the subsequent CTE.
- Each CTE has a descriptive name to make the transformation purpose clearer. This helps in understanding the transformation logic instantly.
- You can test each CTE during development and reference it multiple times to avoid duplicated logic.
For example:
You want to analyze customer buying behavior by segmenting them based on the orders they’ve placed, but only counting non-returned orders. Here’s how you can do it:
WITH import_orders AS (
SELECT * FROM orders
),
aggregate_orders AS (
SELECT
customer_id,
COUNT(order_id) AS count_orders
FROM import_orders
WHERE status NOT IN ('returned', 'return pending')
GROUP BY customer_id
)
),
segment_users AS (
SELECT
customer_id,
count_orders,
CASE
WHEN count_orders >= 3 THEN 'super_buyer'
WHEN count_orders = 2 THEN 'regular_buyer'
ELSE 'single_buyer'
END AS buyer_type
FROM aggregate_orders
)
SELECT * FROM segment_users;
Explanation:
- The first CTE (import_orders) simply loads the orders.
- The second CTE (aggregate_orders) filters out returned orders and counts orders per customer.
- The third CTE (segment_users) segments customers based on their order count.
- The final SELECT pulls the results from the last CTE.
2. Recursive CTEs for hierarchical data
Recursive CTEs handle the extraction and navigation of hierarchical data such as file systems, category trees, and organizational charts.
The workflow:
A recursive CTE has two distinct parts:
- Anchor member: This is the topmost or starting point of the hierarchy.
- Recursive member: The CTE repeatedly joins the base table through the recursive logic to navigate through the hierarchy.
For example:
You have an employees table to examine the organizational hierarchy traversal. The table:
To analyze each employee and management structure, you can use the recursive CTE:
This output shows the full management structure, with the hierarchy column indicating the depth in the hierarchy.
Note: Recursive CTEs run indefinitely when they lack a condition. So, always include a stopping condition in the recursive member to define the number of recursions.
3. Referencing result sets across multiple operations
CTEs define a result set once and reference it multiple times within the same query. This prevents query duplication and keeps SQL queries organized and easy to navigate.
The workflow:
- Multiple use: The CTE written with the WITH clause at the beginning acts like a temporary table that can be used in subsequent queries.
- No repetition: Use the CTE’s name for calculations and filters depending on similar underlying data. No need to write the same subquery logic again.
For example,
You want to calculate the total and average sales in an e-commerce database for each product category. Here’s how to do it:
-- Define a CTE to calculate total and average sales for each category
WITH CategorySales AS (
SELECT Category, SUM(SalesAmount) AS TotalSales, AVG(SalesAmount) AS AverageSales
FROM Products
GROUP BY Category
)
-- Select category, total sales, and average sales from the CTE
SELECT Category, TotalSales, AverageSales
FROM CategorySales
WHERE TotalSales > 5000;
As we can see here, the CategorySales CTE
is defined once at the beginning and then used under the final Select.
Best Practices for CTEs
Deploying CTEs goes beyond breaking down complex queries. The best practices for CTEs include:
- Focused CTEs: Don’t overload CTEs with unnecessary logic. Split complex logic into multiple CTEs to accomplish a single, well-defined task.
- Descriptive names: Assign self-documenting names to CTEs, such as SalesSummary or ProductSales, to make queries easy to understand.
- Defined nesting: While CTEs are layered, excessive nesting reduces readability. Aim for 3–5 CTEs per query unless the complexity requires more.
- Lean CTEs: Select columns needed for downstream logic or the final result. Extra columns add clutter and impact performance, especially with large datasets.
Case Study
For better illustration, let’s take a real-world example from Airbnb listings.
Before: Nested query
A traditional approach is followed to analyze Airbnb listings by city and neighbourhood through deeply nested subqueries, making the logic hard to understand:
FROM
(
SELECT DISTINCT
neighbourhood,
COUNT(DISTINCT id) AS num_apartments,
COUNT(DISTINCT host_id) AS num_hosts,
ROUND(CAST(COUNT(DISTINCT id) AS FLOAT)/COUNT(DISTINCT host_id), 2) AS mean_apartments_by_host,
ROUND(AVG(price), 2) AS AVG_price,
ROUND(AVG(review_cleanliness), 2) AS AVG_cleanliness,
ROUND(AVG(review_location), 2) AS AVG_location
FROM bcn_airbnb_dataset
GROUP BY 1
ORDER BY 2 DESC
) AS T1
Issues:
- It is difficult to extend or debug the query.
- The addition of new metrics and comparisons increases complexity.
- Multiple recalculations of the city average.
After: Modular query with CTEs
Each logical component becomes isolated, which improves the readability and clarity. This compartmentalizes the query.
SELECT
Structure.neighbourhood,
T1.num_apartments,
100 * T1.num_apartments / T2.num_apartments AS comp_apartments,
T1.num_hosts,
100 * T1.num_hosts / T2.num_hosts AS comp_hosts,
T1.mean_apartments_by_host,
T1.AVG_price,
CASE
WHEN T1.AVG_price > T2.AVG_price THEN "Above"
WHEN T1.AVG_price = T2.AVG_price THEN "Equals"
WHEN T1.AVG_price < T2.AVG_price THEN "Below"
END AS "Price_comparison_to_city",
T1.AVG_cleanliness,
CASE
WHEN T1.AVG_cleanliness > T2.AVG_cleanliness THEN "Above"
WHEN T1.AVG_cleanliness = T2.AVG_cleanliness THEN "Equals"
WHEN T1.AVG_cleanliness < T2.AVG_cleanliness THEN "Below"
END AS "Clean_Comparison_to_city",
T1.AVG_location,
CASE
WHEN T1.AVG_location > T2.AVG_location THEN "Above"
WHEN T1.AVG_location = T2.AVG_location THEN "Equals"
WHEN T1.AVG_location < T2.AVG_location THEN "Below"
END AS "Location_Comparison_to_city"
FROM Neighbourhood_List AS Structure
LEFT JOIN Data_by_neighbourhood_BCN AS T1
Result:
- Each CTE addresses a distinct part of the logic.
- The addition of new metrics in a straightforward manner modified the CTE.
- The final query reads like a set of logical steps.
Level 2: Views and Functions
Views are a composition of a table represented in the form of a predefined SQL query. It is created either from one or many tables, depending on the SQL query written to create a view.
You can use it in the following scenarios:
- Centralized logic: When the same query logic is used across various dashboards, reports, and APIs, creating a view ensures consistency and prevents duplication. Any updates to the logic only need to be made in one place.
- Data access control: Views tend to expose only relevant data, which automatically blocks unauthorized access to sensitive rows and columns. This enables subjective access to the view without revealing the underlying data. For instance, you can create a view that omits personal data in an employee table.
- Schema simplification: Views provide a streamlined version of the business-focused schema for end users to eliminate the complexity of the original database structure. This is well-suited for large-scale databases with numerous tables.
Here’s how you can use it:
Creating views
You can create views using the CREATE VIEW
statement from a single table, multiple tables, or another view. The CREATE VIEW
syntax should appear like this:
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE [condition];
Note: Ensure that you have the relevant system privilege based on the required implementation.
Updating the view
For updating views, you must align with these conditions:
- The
FROM
clause might not contain multiple tables. - The
WHERE
clause might not contain subqueries. - The
SELECT
might not contain:- Set functions and set operators
- Summary functions
- An
ORDER
by clause - The keyword
DISTINCT
- All
NOT NULL
columns from the base table must appear in the view for theINSERT
query to function.
The following code block shows how the age of Ramesh is updated in the view:
UPDATE CUSTOMERS_VIEW
SET AGE = 35
WHERE name = 'Ramesh';
Deleting rows in the view
The deletion of rows in a view follows the same rules implemented while updating the view. Here’s how you can use the DELETE command to delete specific rows:
DELETE library_books
FROM library_books JOIN students ON
students.student_id = library_books.lib_id
WHERE lib_id = 1001
SELECT * FROM library_books
The above code block deletes a row from the library_books
table for the student ID 1001.
UDFs for Complex Logic
UDFs or User-Defined Functions in SQL are custom routines designed to encapsulate complex and frequently used logic within the database. They are well-suited for modularizing code, maintaining consistency, and ensuring reliability across SQL queries and APIs.
In general, there are three types of UDFs:
1. Scalar UDFs
Scalar UDFs return a single value and operate on a single row. These UDFs are ideal for capturing complex logic and validation checks that can clutter your queries.
2. Table-valued UDFs
Table-valued UDFs or TVFs return a table in their result. TVFs are suitable for returning filtered datasets depending on input parameters, such as dynamic report generation or extraction of a filtered list based on complex criteria.
3. Aggregate UDFs
These functions perform aggregations on data, similar to built-in aggregate functions like SUM or AVG. For instance, you can create a custom UDF to perform specialized aggregations, such as calculating a geometric mean or a weighted average.
Using UDFs for complex logic
Here’s how you can use UDFs:
1. Encapsulation of complex logic
UDFs help you to wrap intricate calculations into a single, reusable function. This is useful when the logic involves multiple steps or can change over time.
For example,
Suppose you want to calculate a user’s score from the sum of votes in a user_votes
table. Instead of doing repetitive calculations, you can just create a UDF like this:
SELECT user_id, fn_getUserScore(user_id) FROM users
Now, even if the logic changes, you need to update the UDF, and not the entire query.
2. Code reusability
By defining a UDF once, you can use it across various queries and applications multiple times. This avoids code duplication and streamlines maintenance.
For example,
You want to format phone numbers in reports without repeating the formatting logic. Create a UDF:
SELECT name, formatPhone(phone) AS formatted_phone FROM customers;
You can use this UDF for queries requiring formatted phone numbers without any code duplication.
3. Optimizing performance
UDFs can improve performance by centralizing logic and reducing the need for repeated calculations or data transfers between the application and the database.
For example,
Create a UDF to extract keywords from a text column:
SELECT id, extractKeywords(description) AS keywords FROM documents;
The database processes the logic internally, reducing data transfer, and leverages SQL server data type for efficiency.
4. Extending SQL’s capabilities
As discussed earlier, you can create custom logic that exceeds the functions of built-in SQL. The logic is effective for custom aggregations and advanced data transformations.
For example,
You need a custom aggregation, like calculating a geometric mean, which isn’t available natively. The UDF will look like this:
SELECT geometricMean(value1, value2) AS geo_mean FROM data;
Level 3: Modern Transformation Tools
Modern transformation tools are software solutions designed to streamline the process of converting data from one structure or format to another. These tools are crucial for businesses to organize workflows and stay competitive.
Beyond traditional SQL
Traditional SQL solutions excel at filtering, querying, and standard data transformation within relational databases. However, these solutions fall short with diverse sources, large-scale datasets, and real-time processing needs. Organizations demand tools that go beyond classic SQL and manual ETL scripts.
Common challenges with traditional SQL are:
- Lack of native support for testing or versioning.
- Hard to modularize and limited reusability.
- Not ideal for cross-platform data (APIs, files, streams, or cloud).
- Inefficient in managing complex logic or deeply nested queries.
- Poor collaboration within teams.
Modern tooling benefits
The modern data stack includes key features to overcome the above challenges:
- Modularity
- Collaboration
- Cloud-native
- Automated workflows
The key benefits of modern tools include:
Modularity
Modern tools like dbt encourage modular SQL development, robust documentation, and version control. This helps data teams to develop reusable SQL with dbt macros capable of moving across diverse environments.
Orchestration
Advanced tools feature built-in orchestration mechanisms that transform data accurately in the correct order. Accurate data transformation enhances orchestration through streamlining complex processes, automating error handling, and minimizing manual intervention.
AI integration
Features like automated schema mapping, support for large language models (LLMs), and AI-driven pipeline creation and management increase the tool’s adaptability. The tools also facilitate advanced analytics, machine learning, and automated data processing, which enables businesses to build smarter and faster workflows.
Flexibility
Modern tools offer pre-built connectors and wide compatibility, boosting the flexibility to source from multiple data sources like APIs, databases, and cloud-based services. Seamless integration promotes scalability and expands workflows to work with large datasets.
User-friendly UI
Platforms like Hevo provide an intuitive no-code interface that allows users and analysts to build and manage data pipelines without any technical expertise. This enhances the self-service factor, empowering teams to work faster and minimize reliance on specialized engineering resources.
Best Practices & Implementation of Modular SQL
Here are the three golden rules to implement modular SQL:
1. Consistent naming
Always use descriptive, clear, and standardized names for CTEs, views, and functions. Consistent naming makes your SQL code self-documenting, easy to navigate, and minimizes data-related risks during collaboration.
2. Prioritizing readability
Organize SQL structures to logically separate each transformation step. Prioritize using whitespaces, and break complex queries into smaller, focused modules to make code easy to read, debug, and maintain.
3. Debugging and optimization
Design modular SQLs to easily test, audit, and optimize key components. Isolate logic in reusable modules, and use tools that support dependency tracking and result validation to ensure correctness and performance.
How to Write Modular SQL: Step by Step
Implementing modular SQL with modern tools requires a structured, layered approach.
Here’s a step-by-step guide:
Step 1: Transfer and prepare SQL
Transfer your existing SQL scripts into your modular SQL tool’s project directory. Check if the imported code runs seamlessly in your preferred data warehouse. If necessary, adjust SQL syntax to align with your target platform requirements.
Step 2: Configure data sources
For every raw table, declare a source mapping. For instance, in dbt, use the source()
macro to explicitly define and document your data origins.
Step 3: Refactor into modular layers
Create staging models to clean, standardize, and transform data as a consistent foundation for downstream logic. Move repeated joins, aggregations, or reusable logic into intermediate models. Reserve final models for analytics, leveraging the upstream static and intermediate models.
Step 4: Modularize transformations
Create a CTE for every source or input table to make transformations explicit. Use macros for repeatable SQL snippets and further enhance modularity and reduce duplication.
Step 5: Document
Apply style guides, including lowercase keywords and simplified indentation. Set configuration options, tags, and hooks as required for your workflow. Document model purposes, dependencies, and business logic, both code scripts and standard documentation formats.
Step 6: Test and optimize
Compare results from your new modular models with the original queries to ensure accuracy. Use built-in testing tools to validate data integrity, freshness, and model performance.
Move heavy filters and aggregations to intermediate layers and keep final models lean. Rename CTEs and models for future readability.
Common pitfalls to avoid
Although SQL offers significant benefits, some common pitfalls can degrade performance and clarity. Common pitfalls include:
- Overusing CTEs: Excess chained CTEs hurt performance and readability.
- Skipping isolated testing: An error in one module can impact downstream logic.
- Lack of documentation: Undocumented logic is hard to operate and maintain.
- No version control: Not tracking changes in shared modules breaks pipelines.
- Over-abstraction: Overusing modularization makes tracing logic more difficult.
- Poor maintenance: Unchecked modules can fail over time.
Streamline Modular SQL With Hevo
Hevo Transformer is a transformation engine built on top of dbt within the Hevo Data platform. Users can perform complex data transformations directly in their pipeline without extensive coding or a manual workflow.
The transformation engine is tightly integrated with Hevo’s data ingestion and pipeline automation capabilities, which offer a seamless end-to-end ETL experience.
Here’s how it works:
- Unified data pipeline: The platform automates the process of extracting data from diverse sources and loading it into your preferred data warehouse, like Snowflake, BigQuery, and Redshift.
- Transformation: After loading the data into the warehouse, Hevo Transformer cleans, aggregates, joins, and organizes datasets using a SQL-based workflow powered by dbt Core.
- Automation: The transformer automates the building, testing, and deployment of transformation workflows. Moreover, it supports automated scheduling, integration with data warehouses, and error-handling to eliminate manual efforts.
Future-Proof Your SQL
Modular SQL isn’t just a best practice; it’s a technique that results in cleaner, maintainable, and scalable data workflows. When you break down complex logic, the readability, performance, and debugging increase, which makes the query reusable across multiple projects.
Whether you’re using CTEs, views, or dbt tools, leveraging modular SQL helps you adapt to evolving data needs. Hevo Data offers seamless transformation capabilities to structure modular SQL designs and automate schema upgrades.
To get a closer look, sign up for Hevo’s 14-day free trial now.