In modern data analytics, organizations utilize data warehouses to store vast amounts of historical data, ensuring data analysts can easily access critical information. The historical data is vital for deriving actionable insights and driving effective decision-making. However, while most attributes or dimensions of data in a warehouse remain static, certain dimensions like customer addresses, product specifications, or employee designations evolve over time.
The evolving dimensions are also known as slowly changing dimensions (SCDs) in data science. SCDs help businesses retain their original records while also tracking changes over time. As a result, SCDs maintain the integrity of data warehouses and ensure that the insights derived from the stored data take into account the data’s slowly evolving nature. Let’s first understand what dimensions and measures are in data science.
Dimensions and Measures in Data Science
In data science, dimensions and measures are significant elements of data analysis.
Dimensions comprise qualitative values, such as names, dates, or geographical data. You can use dimensions to categorize, segment, and reveal essential details in your data. For instance, sales data can include different dimensions, such as Region, Product Type, or Time Period. The dimensions affect the level of detail in examining the data.
On the other hand, measures are the numeric or quantitative values you can measure. These metrics, within the scope of the dimensions, can be evaluated, compared, or aggregated for data analysis. For example, with sales data, some associated measures include Total Sales, Profit Margin, or Number of Units Sold.
Dimensions and measures in data science form the basis of data visualization and analysis. While dimensions provide a structured view of the data, measures provide statistical information for data-driven decision-making.
What are Slowly Changing Dimensions in Data Science?
Slowly changing dimensions in data science refers to a data warehousing concept. SCDs are dimensions that change over time but at a slow pace. These dimensions typically store and manage both historical and current data in a data warehouse.
SCDs are considered critical in tracking the history of dimension records. These dimensions are important for tracking changes, especially attributes that change infrequently or slowly over time but not at predictable intervals. Examples of SCDs include changing data about an entity, such as the pricing of a product, customer address, or warehouse location.
Slowly changing dimensions in data science are important for tracking changes in data over time for accurate analysis. Data practitioners can use SCDs for employee performance analysis, sales analysis, and inventory management.
Types of Slowly Changing Dimensions
There are five types of slowly changing dimensions in data science.
SCD Type 0: Retain Original
Type 0, also known as Retain Original type, refers to dimensions or attributes that never change and will not be updated in the data warehouse. The original dimension value is always retained, and no tracking of historical data takes place. Type 0 is applicable to most date dimension attributes.
Examples of SCD Type 0 include:
- Date of birth
- An employee’s start date with a company
- Social security number
SCD Type 1: Overwrite
Type 1, or Overwrite type, refers to an instance where old data is overwritten with new data. The old data is lost as it is not stored anywhere, and the latest snapshot of a record is maintained in the data warehouse without any historical records. With this SCD type, you cannot keep track of changes over time.
This is the default type of dimension you create; you don’t need to specify additional information to create an SCD Type 1.
The benefits of SCD Type 1 include requiring less storage space and being easy to maintain. However, the disadvantage of the Type 1 method is that historical data is lost. Also, it doesn’t support the analysis and reporting of historical data.
This type is commonly used to correct errors in a dimension, updating values that were irrelevant or wrong. No history is retained.
Here are some SCD Type 1 examples:
- Customer address data: The old address is replaced with the new address, and the history of the old address is not required.
- Inventory data: When the stock of an item is updated, the new stock level replaces the old stock level.
- Employee salary data: When an employee’s salary is updated, the new salary replaces the old salary, and the history of the old salary is not maintained.
SCD Type 2: Add New Row
SCD Type 2, also known as the historical tracking method, helps maintain the history of changes in a dimension table.
Every time there is a change in the source system, a new row is added to the dimension table with a unique identifier. The resulting table will retain the prior history, allowing full history tracking.
There are two ways to handle SCD Type 2 dimensions:
- Adding a flag column signifying the record that is currently active. The flag column uses a simple indicator (such as 0/1 or Y/N) to denote whether a specific record is currently active or a valid version.
- Adding one or two timestamp columns to signify when a new record was created or made active and when it was made ineffective.
The benefit of SCD Type 2 is that no data is lost as the history of a record is maintained. This helps with accurately tracking changes over time. However, it can be complex to implement and requires more storage space. Adding new rows to the table adds to the size, eventually making the table unmanageable.
This method is suitable for situations where you require historical data for reporting and analysis. But, if there is a possibility that the structure of the data will change (such as new columns being added to the table), this type is less likely to be used.
Examples of SCD Type 2 include:
- Sales data: You can track your sales of a particular item or profit over time to help you analyze your sales.
- Customer or order data: You can keep track of previous orders of a person and recommend items based on that.
- Customer address: When a customer relocates, a new address entry with a new timestamp is added to the data warehouse. The previous address is retained with its respective timestamp.
SCD Type 3: Add New Attribute
Similar to SCD Type 2, you can also track record changes with SCD Type 3. However, instead of adding a new row for every change, a new column is added to track the previous value. But, this method preserves the most recent history, as it is limited to the number of columns designated for storing historical data.
The SCD Type 3 stores two versions of values for the selected level attributes. Each record will store the current and previous values of the selected attribute. Whenever the value of the attribute changes, the current value gets stored as the old value, and the new value becomes the current value.
You can only track one change in a record rather than multiple changes over time. This method isn’t scalable if you want to preserve complete history, as it allows only to keep the latest version of the history.
The benefit of this SCD type is it requires less storage space. Also, it enables fast queries since there’s limited history to scan. However, it doesn’t provide a complete history of changes since it only tracks limited changes. For columns that require full history, it might not be suitable for tracking changes.
An example of SCD Type 3 is in product pricing or financial reporting, where the previous and current values of a service or product are important.
SCD Type 4: Add History Table
This slowly changing dimension involves maintaining the records in two different tables—a current record table and a historical record table. Changes are tracked in a separate history table. While the main dimension table stays current, the history table stores past data.
With SCD Type 4, a record will be added to the history table for each change in the source system. It’s useful for keeping track of records that have many changes over time. It allows to retain both the original and updated data while still maintaining a small footprint for the main table. The method resembles how change data capture techniques and database audit tables function.
An example of SCD Type 4 is when a product’s price changes. The old price and the effective date are added to a separate history table, while the current price is stored in the main dimension table.
SCD Type 6: Combined Approach
The SCD Type 6 is a hybrid approach of Type 1, Type 2, and Type 3 (1 + 2 + 3 = 6). This includes columns for both historical and current data and a column to track the current version of the record.
Both historical and current data can be stored in the same row, with the current version being easily accessible. This SCD type is useful when a business wants to view both current and historical data in the same report.
An example of SCD Type 6 is when an employee’s role and department might change. While the role change can be presented in a new row (Type 2), the department can be overwritten (Type 1). A column can indicate the previous role (Type 3).
Limitations of Slowly Changing Dimensions in Data Science
While SCDs are an essential part of data warehousing, allowing historical data to be preserved along with changes over time, there are some associated limitations, such as:
- Storage Space: A dataset with a vast number of input features can rapidly increase storage requirements. This is especially true for SCD Type 2, which involves adding a new row for each change.
- Maintenance: With an increase in the number of attributes or dimensions, tracking becomes complex. This can result in an increased possibility of errors.
- Performance: As the data volume increases to track historical changes, it may sometimes deteriorate the performance of the machine learning model.
- Scalability: With growing data and accumulated changes, some SCD types might not scale efficiently. This affects storage, computation, and visualization.
To help overcome these limitations, you can use dimension-reduction techniques.
So, what is dimension reduction in data science?
Dimension reduction is a machine learning or statistical technique that reduces the number of features in a dataset while retaining as much important information as possible. Here, dimensions refer to the number of input features, variables, or columns of a given dataset.
- Slowly changing dimensions in data science involve the addition of new rows, columns, or attributes to capture changes, inherently adding to the dimensionality of the dataset. If you intend to build predictive models that use this historical data, multiple dimensions may result in overfitting. However, with dimension reduction, such datasets become more manageable and help avoid overfitting of models.
- With slowly changing dimensions in data science, every change captured results in an increase in data storage requirements. Additionally, processing datasets with increasing dimensions can be computationally expensive. However, dimension reduction can help overcome these challenges while ensuring that crucial information is retained.
- Most slowly changing dimensions in data science involve redundancy. For example, in SCD Type 2, a new row is added for every change in a dimension. Eventually, this results in multiple rows for the same entity and increased dimensionality. Dimension reduction helps identify and manage such redundancy.
How to Implement Slowly Changing Dimensions in a Data Warehouse?
Ideally, slowly changing dimensions are best considered right at the start of your database creation. If you’re just beginning to build your data infrastructure or working for a startup, this is easier to implement.
For existing databases, start by assessing the data that currently exists in your database. Document the different types of dimensions you find and how they relate to one another. If there are dimensions of type 0 that shouldn’t be, start with those. And if required, add historical tracking as soon as possible.
Next, determine whether you would prefer types 2, 3, or 4 and if they are suitable for your business. Some questions to evaluate are:
- How often do these dimensions change?
- Would you use a timestamp column or a flag column?
- Should you add a historical table that can store a lot of records?
Such a process will involve analytics engineers, data engineers, and data analysts.
Another important thing to consider is how you want to handle previous records that weren’t tracked over time. While you can forgo the historical data and implement SCD for future operations, it would be best to put together old records and create your own version of a timestamp or flag column.
Slowly changing dimensions in data science accommodate the slowly evolving nature of datasets and help preserve historical records. With constant changes in business scenarios, maintaining historical data is vital for effective data analysis.
The five types of slowly changing dimensions in data science provide different mechanisms to track and store changes in evolving data. You can implement the appropriate slowly changing dimension type based on your business needs. This will help you perform accurate analytics for critical insights and improved business growth.
Despite the benefits associated with the different SCD types, as the number of dimensions increases in SCDs, storage, computation, and visualization challenges arise. Dimension reduction in data science can help manage massive datasets better by reducing the number of dimensions while still retaining important information.
In case you want to integrate data into your desired Database/destination, then Hevo Data is the right choice for you! It will help simplify the ETL and management process of both the data sources and the data destinations.
Visit our Website to Explore Hevo
Offering 150+ plug-and-play integrations and saving countless hours of manual data cleaning & standardizing, Hevo Data also offers in-built pre-load data transformations that get it done in minutes via a simple drag-and-drop interface or your custom Python scripts.
Want to take Hevo Data for a ride? SIGN UP for a 14-day free trial and experience the feature-rich Hevo suite first hand. Check out the pricing details to understand which plan fulfills all your business needs