What is Data Warehousing?
Data warehousing is the process of collating data from multiple sources in an organization and store it in one place for further analysis, reporting and business decision making. Typically, organizations will have a transactional database that contains information on all day to day activities. Organizations will also have other data sources – third party or internal operations related. Data from all these sources are collated and stored in a data warehouse through an ELT or ETL process. The data model of the warehouse is designed such that, it is possible to combine data from all these sources and make business decisions based on them.
In this blog, we will discuss 6 most important factors and data warehouse best practices to consider when building your first data warehouse:
- Impact of Data Sources
- The Choice of Data Warehouse
- ETL vs ELT
- Data Warehouse Architecture Considerations
- ETL Tool Considerations
- Bonus – Other Important Factors
Data Warehouse Best Practices: Impact of Data Sources
Kind of data sources and their format determines a lot of decisions in a data warehouse architecture. Some of the best practices related to source data while implementing a data warehousing solution are as follows.
- Detailed discovery of data source, data types and its formats should be undertaken before the warehouse architecture design phase. This will help in avoiding surprises while developing the extract and transformation logic.
- Data sources will also be a factor in choosing the ETL framework. Irrespective of whether the ETL framework is custom-built or bought from a third party, the extent of its interfacing ability with the data sources will determine the success of the implementation.
Data Warehouse Best Practices: The Choice of Data Warehouse
One of the most primary questions to be answered while designing a data warehouse system is whether to use a cloud-based data warehouse or build and maintain an on-premise system. There are multiple alternatives for data warehouses that can be used as a service, based on a pay-as-you-use model. Likewise, there are many open sources and paid data warehouse systems that organizations can deploy on their infrastructure.
On-Premise Data Warehouse
An on-premise data warehouse means the customer deploys one of the available data warehouse systems – either open-source or paid systems on his/her own infrastructure.
There are advantages and disadvantages to such a strategy.
Advantages of using an on-premise setup
- The biggest advantage here is that you have complete control of your data. In an enterprise with strict data security policies, an on-premise system is the best choice.
- The data is close to where it will be used and latency of getting the data from cloud services or the hassle of logging to a cloud system can be annoying at times. Cloud services with multiple regions support to solve this problem to an extent, but nothing beats the flexibility of having all your systems in the internal network.
- An on-premise data warehouse may offer easier interfaces to data sources if most of your data sources are inside the internal network and the organization uses very little third-party cloud data.
Disadvantages of using an on-premise setup
- Building and maintaining an on-premise system requires significant effort on the development front.
- Scaling can be a pain because even if you require higher capacity only for a small amount of time, the infrastructure cost of new hardware has to be borne by the company.
- Scaling down at zero cost is not an option in an on-premise setup.
Cloud Data Warehouse
In a cloud-based data warehouse service, the customer does not need to worry about deploying and maintaining a data warehouse at all. The data warehouse is built and maintained by the provider and all the functionalities required to operate the data warehouse are provided as web APIs. Examples for such services are AWS Redshift, Microsoft Azure SQL Data warehouse, Google BigQuery, Snowflake, etc.
Such a strategy has its share of pros and cons.
Advantages of using a cloud data warehouse:
- Scaling in a cloud data warehouse is very easy. The provider manages the scaling seamlessly and the customer only has to pay for the actual storage and processing capacity that he uses.
- Scaling down is also easy and the moment instances are stopped, billing will stop for those instances providing great flexibility for organizations with budget constraints.
- The customer is spared of all activities related to building, updating and maintaining a highly available and reliable data warehouse.
Disadvantages of using a cloud data warehouse
- The biggest downside is the organization’s data will be located inside the service provider’s infrastructure leading to data security concerns for high-security industries.
- There can be latency issues since the data is not present in the internal network of the organization. To an extent, this is mitigated by the multi-region support offered by cloud services where they ensure data is stored in preferred geographical regions.
The decision to choose whether an on-premise data warehouse or cloud-based service is best-taken upfront. For organizations with high processing volumes throughout the day, it may be worthwhile considering an on-premise system since the obvious advantages of seamless scaling up and down may not be applicable to them.
Data Warehouse Best Practices: ETL vs ELT
The movement of data from different sources to data warehouse and the related transformation is done through an extract-transform-load or an extract-load-transform workflow. Whether to choose ETL vs ELT is an important decision in the data warehouse design. In an ETL flow, the data is transformed before loading and the expectation is that no further transformation is needed for reporting and analyzing. ETL has been the de facto standard traditionally until the cloud-based database services with high-speed processing capability came in. This meant, the data warehouse need not have completely transformed data and data could be transformed later when the need comes. This way of data warehousing has the below advantages.
- The transformation logic need not be known while designing the data flow structure.
- Only the data that is required needs to be transformed, as opposed to the ETL flow where all data is transformed before being loaded to the data warehouse.
- ELT is a better way to handle unstructured data since what to do with the data is not usually known beforehand in case of unstructured data.
As a best practice, the decision of whether to use ETL or ELT needs to be done before the data warehouse is selected. An ELT system needs a data warehouse with a very high processing ability.
Data Warehouse Best Practices: Architecture Consideration
Designing a high-performance data warehouse architecture is a tough job and there are so many factors that need to be considered. Given below are some of the best practices.
- Deciding the data model as easily as possible – Ideally, the data model should be decided during the design phase itself. The first ETL job should be written only after finalizing this.
- At this day and age, it is better to use architectures that are based on massively parallel processing. Using a single instance-based data warehousing system will prove difficult to scale. Even if the use case currently does not need massive processing abilities, it makes sense to do this since you could end up stuck in a non-scalable system in the future.
- If the use case includes a real-time component, it is better to use the industry-standard lambda architecture where there is a separate real-time layer augmented by a batch layer.
- ELT is preferred when compared to ETL in modern architectures unless there is a complete understanding of the complete ETL job specification and there is no possibility of new kinds of data coming into the system.
Data Warehouse Best Practices: Choosing the ETL tool – Build vs Buy
Once the choice of data warehouse and the ETL vs ELT decision is made, the next big decision is about the ETL tool which will actually execute the data mapping jobs. An ETL tool takes care of the execution and scheduling of all the mapping jobs. The business and transformation logic can be specified either in terms of SQL or custom domain-specific languages designed as part of the tool. The alternatives available for ETL tools are as follows
- Completely custom-built tools – This means the organization exploits open source frameworks and languages to implement a custom ETL framework which will execute jobs according to the configuration and business logic provided. This is an expensive option but has the advantage that the tool can be built to have the best interfacing ability with the internal data sources.
- Completely managed ETL services – Data warehouse providers like AWS and Microsoft offer ETL tools as well as a service. An example is the AWS glue or AWS data pipeline. Such services relieve the customer of the design, development and maintenance activities and allow them to focus only on the business logic. A limitation is that these tools may have limited abilities to interface with internal data sources that are custom ones or not commonly used.
- Fully Managed Data Integration Platform like Hevo: Hevo Data’s code-free platform can help you move from 100s of different data sources into any warehouse in mins. Hevo automatically takes care of handling everything from Schema changes to data flow errors, making data integration a zero maintenance affair for users. You can explore a 14-day free trial with Hevo and experience a hassle-free data load to your warehouse.
Other Data Warehouse Best Practices
Other than the major decisions listed above, there is a multitude of other factors that decide the success of a data warehouse implementation. Some of the more critical ones are as follows.
- Metadata management – Documenting the metadata related to all the source tables, staging tables, and derived tables are very critical in deriving actionable insights from your data. It is possible to design the ETL tool such that even the data lineage is captured. Some of the widely popular ETL tools also do a good job of tracking data lineage.
- Logging – Logging is another aspect that is often overlooked. Having a centralized repository where logs can be visualized and analyzed can go a long way in fast debugging and creating a robust ETL process.
- Joining data – Most ETL tools have the ability to join data in extraction and transformation phases. It is worthwhile to take a long hard look at whether you want to perform expensive joins in your ETL tool or let the database handle that. In most cases, databases are better optimized to handle joins.
- Keeping the transaction database separate – The transaction database needs to be kept separate from the extract jobs and it is always best to execute these on a staging or a replica table such that the performance of the primary operational database is unaffected.
- Monitoring/alerts – Monitoring the health of the ETL/ELT process and having alerts configured is important in ensuring reliability.
- Point of time recovery – Even with the best of monitoring, logging, and fault tolerance, these complex systems do go wrong. Having the ability to recover the system to previous states should also be considered during the data warehouse process design.
The above sections detail the best practices in terms of the three most important factors that affect the success of a warehousing process – The data sources, the ETL tool and the actual data warehouse that will be used.
Are there any other factors that you want us to touch upon? Let us know in the comments!