The Reign of Modern Data Stack: History, Components, and Use Cases
About the Author Can Goktug Ozdem is the founder of Datrick. He is a data engineer with over nine years of experience in the field. He is a big fan of remote work and is passionate about bringing insights through data while traveling to different parts of the world.
We’ve been dealing with data for as long as human history goes. However, the way we do it has evolved immensely, especially in the past few decades. Modern Data Stack has become an integral part of organizations regardless of their size, and the need for data engineers and analysts is on the rise.
Table of Contents
So, let’s explore the origins and essence of the Modern Data Stack to help you better understand its components.
Dominance of the On-Premise Data Stack till the Last Decade
In essence, a data stack is a set of tools and technologies that an organization uses to compile, process, store, and transform data. On its own, raw data holds no value. Thus, the role of the data stack is to allow an organization to store all data in one place and provide a means for data engineers and analysts to process, polish, and analyze this data, ensuring that it is useful to the organization.
As technology evolves, so does the data stack.
While the origins of the term “data warehouse” date back to the 1960s, it was in the 1980s when IBM researchers developed what we now know as the business data warehouse. It was, though, far from what data warehouses look like now. The modern cloud also is a relatively new concept that did not exist prior to the early 2000s.
Back then, data basically belonged to IT. Teams and employees outside IT had to ask IT teams to perform their work because managing data and data infrastructure weren’t within their workflow or responsibilities. Sales, marketing, and similar executives worked with data through systems such as Crystal Reports and ERPs, but data tooling required completely different skill sets that they neither possessed nor were required to possess.
For an organization to be data-centric, it needed to acquire legacy servers and hardware, which could be hosted and managed on its own premises and with its own infrastructure.
As time progressed, between approximately 2000 and 2010, technology kept evolving and becoming more common and accepted. Mass digitalization led to the demand for more data. Moreover, those who were not in IT developed a growing interest and need to gain access and control over the data. A great example is marketing teams. They thrive on data, yet IT was unable to supply it quickly enough.
On-Premises Data Stack Shortcomings and Innovations They Lead To
Speed wasn’t the only issue. There are various major challenges with on-premises or the traditional data stack (TDS):
- Costs: Purchasing, setting up, and managing hardware and servers for the on-premises data stack requires significant capital. As infrastructure deprecates, it needs to be replaced.
- Performance: Producing analytics that the company can use is complicated. It requires the investment of significant time and resources.
- Scalability and flexibility: Scaling a traditional data stack requires investing in additional infrastructure, often without the ability to adequately match infrastructure capability to the organization’s needs.
- Security: The security practices of the traditional data stack and any incidents that arise are primarily the responsibility of the organization.
- Troubleshooting: Organizations have to rely on in-house specialists to troubleshoot issues and provide solutions.
These drawbacks have prompted organizations to innovate to meet the need for faster and more agile data services. As a result, we witnessed several innovations.
In 2006, Amazon launched its Amazon Web Services or AWS, offering a variety of services to other companies. They included the ability to connect to virtual computers and use remote storage.
A year later, IMB, Google, and various US universities worked together to develop server farm projects that worked with large data sets and required fast processors. Universities quickly realized that this enabled research to be performed much faster and cheaper.
As word spread, change and the rise of the Modern Data Stack were inevitable.
The Rise of SaaS and Modern Data Stack
Because of the limitations of the traditional data stack, businesses needed a more flexible solution that provided them with more control over their data. A pivotal point in time for that was October 2012. It’s when Amazon launched Redshift, a Data Warehouse (DWH) cloud solution.
This paved the way for the Modern Data Stack (MDS) and a range of solutions facilitating it, such as Google BigQuery and Snowflake.
The Modern Data Stack is a combination of tools for data integration that are easy to deploy and can be tailored to address very specific business use cases. As opposed to the legacy on-premises data stack, it uses the cloud rather than hardware to store data.
Let’s take a look at the evolution of the Modern Data Stack over three timeframes:
1. Cambrian Explosion I (2012 – 2016)
While many of the current data integration tool providers, were launched about the same time or earlier than Redshift, Redshift changed the landscape of data integration and was the catalyst that made things take off.
The reason for that was internal architectural differences. Amazon Redshift is designed for massively parallel processing (MPP) and online analytic processing (OLAP) as opposed to online transaction processing (OLTP).
The OLTP system primarily focuses on recording the current Update, Insertion, and Deletion during a transaction. The OLTP queries are simple and short. Therefore, they require less processing time and less space as well as are updated frequently. OLTP databases use normalized tables (3NF). Data integrity requires special attention because, in the case of a failed transaction, data integrity can be affected.
A great example of the OLTP system is an ATM transaction.
OLAP transactions allow complex queries that extract multidimensional data and store historical data that has been input by OLTP. Thus, OLAP transactions are long, less frequent than OLTP, and require more time and space. If a transaction fails, data integrity remains intact, and the user can simply fire the query again. As a result, data tables do not need to be normalized.
Some examples of OLAP transactions include budgeting reports, financial reports, marketing metrics, and sales reports.
Having been designed for OLAP, Redshift is able to process large-volume data sets 10-1000 times faster than OLTP applications.
While Redshift wasn’t the first MPP database, it was the first MPP database that was cloud-native. This enabled massive price reductions, making Redshift the go-to solution and one of the fastest-growing AWS services for a period of time.
With these two critical factors, Redshift solved two of the most pressing problems – speed and price. Seemingly overnight, data processing became not only fast but affordable to virtually anyone, making previously popular tools legacy software.
2. Deployment (2016 – 2020)
Between 2012 and 2016, there was a spike in innovation spurred by Redshift. Companies were catching up and releasing their solutions. Yet, in the years following that, the rate of innovation slowed down.
This is the normal ebb and flow of the innovation cycle. The new solution begins to mature, become more reliable, and cover more use cases as customers increasingly deploy it.
During this time period, BigQuery released the standard SQL, and Snowflake’s solutions matured. This led to both companies growing in popularity.
3. Cambrian Explosion II (From 2020 Onwards)
After significant innovation that followed the launch of Redshift and the subsequent maturation period, it’s safe to say that we’re headed toward another wave of innovation. Companies that currently provide Modern Data Stack services have solidified their offerings and their clients have accepted and adopted them. Now, these solutions can act as a foundation for new innovations to be built upon.
So, what can we expect in the coming years?
One such area is governance. It’s responsible for a wide range of use cases which include discovering data assets and accessing lineage information; however, it’s relatively immature. Without adequate governance, organizations experience more chaos, which, in turn, leads to the loss of trust. Therefore, for governance and its providers to be more accepted and trusted, it needs to evolve.
Real-time data access is another area that could see innovations in the near future. While for most use cases, real-time data visibility isn’t a necessity, there are significant opportunities for new use cases if that was an option.
Completing the data feedback loop is yet another possibility for potential additional use cases. Nowadays, data goes from data sources into the modern data stack, where it’s analyzed. Once that’s done, for data to drive action, someone needs to pick it up and proactively work on it. If, however, it was possible to feed the data directly into relevant operational systems, this could unlock many new possibilities.
Finally, I would like to mention verticalized analytical experiences. Making both horizontal tooling and verticalized tooling easily accessible could make data analytics more streamlined and faster.
The Need for the Modern Data Stack
The emergence of cloud data warehouse solutions introduced the following improvements:
1. Higher Data Processing Speed
Cloud data warehouse solutions significantly decreased the time necessary to process SQLs. Until then, slow data processing was one of the main challenges preventing organizations from using their big data. Redshift eliminated this key obstacle, giving businesses an opportunity to better exploit their data.
2. Improved Connectivity
The cloud-based data warehouses manage a wider range of formats and data sources than legacy on-premise servers, making it considerably easier to connect data sources to the warehouse on cloud.
3. Unlimited User Access
On-Premise data warehouses are managed by a company’s team, and end-users get restricted or indirect access. This is deliberate because it reduces the number of SQL requests made, thus saving server resources. By contrast, cloud data warehouses use virtual servers. This allows several simultaneous SQL queries to target the same database, making it accessible to all end-users.
4. Increased Flexibility and Scalability
Because cloud data warehouse solutions do not require companies to invest in their own infrastructure, this makes them significantly more lightweight, more flexible, and affordable.
Instead of needing to spend time and resources to purchase and install servers, companies can simply plug and play and pay as they go, for usage only. As a result, cloud data warehouses are not only available to large enterprises, but also to small and medium-sized businesses and startups.
5. Reduced Costs and Errors
To generate reports using the Traditional Data Stack, companies needed manual labor. Data analysts and engineers had to manually generate reports, clean them, and then transfer the data to Excel. The process was not only time-consuming but also prone to human error. Often, it required data analysts and engineers to focus on generating reports rather than performing their regular duties.
Cloud data warehouse solutions allow companies to generate actionable data insights within a fraction of the time as well as save resources by working with leaner data teams.
Components of the Modern Data Stack
The Modern Data Stack is a combination of tools for data integration that are easy to deploy and can be tailored to address very specific business use cases. It’s how a company’s data is managed and leveraged, or, more precisely, how you can collect, transform, store, model, visualize, and activate data.
There are numerous tools that can help with each of these and it’s up to each company to decide which best suit its goals, size, budget, and other resources. So, let’s review some of the most important ones by category.
1. Data Integration
Data integration is the first step in any data strategy. It’s the process of transporting data from a wide range of sources to a data storage location where the organization will assess, use, and analyze this data. The storage destination is typically a data warehouse while it can also be a database, data lake, data mart, or simply a document store.
Sources can be virtually anything that is relevant to the company, such as but not limited to:
- SaaS like CRMs,
- HTTP Clients,
- Event streams,
- Spreadsheets, or
- Any information acquired through the Internet.
Some of the most popular tools in this category are:
- Hevo: An end-to-end data pipeline platform that facilitates easy pulling of data from all your sources to the warehouse. It can also help businesses run transformations for analytics as well as deliver operational intelligence to various business tools. Hevo integrates with over 150 data sources and over 15 destinations.
- Airbyte: An open-source data integration platform that focuses on data portability, data security, and data accuracy.
- Fivetran: An automated data integration provider with ready-to-use connectors that automatically adapt as APIs and schemas change, thus continuously synchronizing data from source applications to the preferred organization’s destination.
- Stitch: A data integration platform that enables the moving of data from more than 130 data sources into a data warehouse and provides analysis-ready data with no coding skills required.
2. Data Storage and Querying
In the Modern Data Stack, data storage is usually done on a cloud data warehouse or data lake or, in other words, the destination where the ingestion tool is going to send your data. Meanwhile, data querying refers to retrieving data from a data warehouse or a database.
The key players in this category are:
- Snowflake: Currently, one of the most popular Modern Data Stack storage platforms that promises to unify data warehouses, data lakes, and siloed data.
- BigQuery: Cloud data warehouse from Google that is serverless and cost-effective. It provides multi-cloud data warehouse services that enable businesses to turn big data into actionable business insights.
- Redshift: Amazon’s cloud data warehouse that uses SQL, AWS-designed hardware, and machine learning to analyze both structured and semi-structured data from databases, data warehouses, and data lakes.
- Databricks: Data lakehouse, architecture, and AI company that allows congregating data, analytics, and AI on a single platform.
- Firebolt: A newcomer among cloud data warehouses that boasts speed and elasticity at scale.
3. Data Transformation
Data transformation in the Modern Data Stack refers to the process of converting, cleaning, and structuring a document from one format or source system, for example, Excel spreadsheet, XML spreadsheet, or CSV document, to another, specifically one required by the destination system.
The most popular tool in this category is:
- DBT: A data transformation tool that helps data analysts and engineers to transform, test, as well as document data in the cloud data warehouse.
4. Data Visualization
In the Modern Data Stack, data visualization is the process of representing data through visual contexts, such as graphs, charts, animations, maps, and similar.
The most-commonly chosen tools in this category are:
- Looker: Business Intelligence (BI) software and big data analytics platform that enables you to easily explore, analyze, and share real-time business analytics in an interactive and dynamic way.
- Power BI: Microsoft’s data visualization cloud service that allows users to view dashboards and reports as interactive visuals.
- Tableau: Business analytics platform that allows users to create intuitive and visual analytics that make data interactive, easy to understand, and engaging.
5. Data Governance and Monitoring
Data governance and monitoring in the Modern Data Stacks is the process of managing the quality, availability, usability, integrity, and security of data based on internal data standards and controlling data usage.
Some of the top-performing companies in this category are:
- Monte Carlo: A data reliability company that boasts to be the industry’s first end-to-end data observability platform. Its mission is to reduce data downtime in order to accelerate the adoption of data globally and realize the full potential of data.
- Alation: A data governance platform that aims to build trust through visibility in data as well as enable data democratization in order to balance risk mitigation and allow key people in your organization to access data.
- Datafold: A data reliability platform that focuses on automating analytical data quality management so that companies can extract more value from it.
Data Use Cases
As per the practical application of the Modern Data Stack and the tools that support it, there are three primary use cases. They are important to consider so that your organization can select the best tools for your specific business needs.
1. Collaboration among Data Analysts and Scientists
Various teams and departments in your organization need data to effectively perform their duties and responsibilities.
The marketing and sales teams need to gain a thorough view of the metrics to pivot marketing strategies accordingly. Customer support teams require data to effectively respond to customer support requests and leave them happy. Meanwhile, business development executives need to gain a complete view of the organization to make the best investment decisions.
Ensuring that your data is easily accessible and understood not only helps non-IT staff to effectively interpret it, but also enables collaboration between different departments and teams.
2. Business Intelligence
Business Intelligence is the process of combining a variety of software and systems such as business analytics, data mining, data tools, data infrastructure, and data visualization with best practices to enable organizations to make better decisions.
Comprehensive data allows companies to eliminate inefficiencies, adapt to market changes, respond to changing customer behaviors, navigate supply fluctuations, and drive change.
For this to be possible, however, data needs to be easily accessible and easily understood by decision-makers. This is exactly what the Modern Data Stack helps organizations to accomplish.
3. Reverse ETL
ETL (Extract-Transform-Load) and ELT (Extract-Transform-Load) are both data integration processes that allow businesses to extract data from third-party sources and save them into a target destination such as a data warehouse. An example of that would be extracting data from Salesforce, HubSpot, or Klaviyo and loading that into your destination of choice, such as Snowflake or BigQuery.
In reverse ETL, the company’s data gets extracted from the data warehouse and transformed in order to meet the requirements of a third-party system, where it will be subsequently loaded for further action. In this case, you would extract data from Snowflake or BigQuery and load it back to Salesforce, HubSpot, or Klaviyo.
Regardless of whether you used ETL or ELT to load data into your data warehouse, the reverse method is called a reverse-ETL. That’s because data warehouses are unable to load data directly to third parties applications. The data first needs to undergo transformation to meet the format that the third-party system requires.
There are various ways in which you can benefit from reverse ETL. Some of them include:
- Combining the information on products, sales, and support so that marketing teams can use HubSpot for more personalized campaigns.
- Synchronizing internal support channels so that customer service through Zendesk is improved.
- Uploading customer data to Salesforce to improve the sales process.
Control Your Data Rather Than Let It Control You
While the Modern Data Stack and the numerous tools help businesses gain increased visibility and usability of their data as well as allow having greater autonomy over it, it’s important to ensure that data doesn’t consume your organization. Choosing tools that best suit your business’s goals and needs and managing them can still be a challenge.
If you would like to ensure that your data is in good hands while you focus on growing and scaling your business, the Datrick team of data engineers and analysts can help you select the best repertoire of tools, deploy, and manage them. Reach out to us for a free consultation.