What Is Data Mesh and How It Helps Clean Your Data Mess

on Big Data, Data Mesh • August 13th, 2022 • Write for Hevo

What is Data Mesh
Data mesh is a new approach to big data management. It enables organizations to get value from data at scale, despite the messiness and organizational complexity. Data mesh uses a decentralized architecture to delegate data responsibility at the domain level and deliver high-quality transformed data as a product—all while ensuring data quality and governance.

Nearly all businesses in the era of self-service identify themselves as data-first. Still, not all companies embrace their data architecture with the democratization and scalability it requires.

This demand for big data democratization and scalability comes from the fact that your current data architecture, which usually consists of a data warehouse or a data lake, may not be able to sustain a growing number of new data sources and diverse use cases. To stay flexible in large, complicated environments, you need a new approach—data mesh—to get value from data fast and cost-effectively. 

At its core, data mesh is about aligning people, processes, and organizations, not just technology. This is because we believe that the magic lies not in improving your current technology stack, but in the people and processes already in your company. They build your products, core values, and, ultimately, an effective data-driven company. 

But just as good things don’t come easily, data mesh isn’t a quick fix that delivers results right away. It is a multiyear transformation that includes changes to data culture, data organizational structure, data roles, data architecture, and technology. In this piece, we explore the objectives of data mesh, and how it empowers your teams to handle fast-growing business complexities, make data operations efficient at scale, and drive a profitable data-driven culture.

The What and Why of Data Mesh

What is data mesh? Our definition

We think of data mesh as a disciplined approach to accessing and using crucial data in your organization. The goal here is to deliver value at scale; sustainably.

Zhamak Dehghani, the director of emerging technologies at Thoughtworks, introduced the concept of data mesh for developing and managing big data platforms in her widely read 2019 blog article. She calls it a “socio-technical paradigm” where businesses co-design the social and technical architecture elements simultaneously in an attempt to simplify user and technology interactions.

This new approach doesn’t rely on any revolutionary technology in and of itself; instead, it uses a cutting-edge topological system that places data and domain experts at the core of business operations. Data mesh essentially focuses on sharing, accessing, and managing analytical data in complex and large-scale environments inside and outside your organization.

It carries two core ideas:

  • Decentralized Domain Ownership of Data: Data mesh enables decentralized domain “ownership” of data. By ownership, we mean total accountability from start to finish. Decentralization decomposes the data logically and enables domain experts to manage the data independently. 
  • Product Thinking to Analytics Data: Domain-oriented data is viewed directly as a discoverable, understandable, trustworthy, and valuable product for all users.

A Simple Analogy: Auto Inventory Management 

To put things into perspective, consider a car manufacturing company that procures multiple components from different manufacturers. Here, all parts arrive at a central warehouse, similar to a single source of truth in a big data environment. 

In this company, a central inventory manager supervises quality procurement and distribution to respective departments, akin to centralized data teams who ensure quality and accurate data delivery from data producers to data consumers.

On a small scale, these tasks, from procurement to distribution, seem manageable. The central inventory manager can receive assistance from more personnel in case demand increases. But as production levels mushroom and more and more components get added, the set of tasks becomes fiendishly tricky for this small team.

To seamlessly handle large-scale distribution, the central inventory manager has to use a distributed approach that decentralizes his responsibilities so that components can be categorized and assembled correctly from the start.

Data Mesh Analogy: What is Data Mesh?
Data Mesh Analogy: Auto Inventory Management

Making this process more efficient means distributing components according to their function, size, and weight into function-specific warehouses (just like treating data as a product), each with its own inventory heads (domain experts in the data environment). 

In this framework, the central inventory manager manages the high-level components, while individual department-specific inventory managers manage the low-level components to ensure that they are not stuck in isolation, hampering the manufacturing output.

Why Do You Need a Data Mesh?

Many companies follow a centralized model where a single data team maintains and uses a data lake or data warehouse to serve the analytical requirements of operational teams. These teams ensure that data is of the desired quality and provided promptly to all users.

However, in most cases, central data teams are typically disconnected from both individual use cases and the specifics of data creation in the source systems. Even if the team members of such a central data team are motivated to carry out this duty, they typically lack the subject expertise and direct problem-solving skills. 

Instead, they have to exhort data-producing teams to carry out this work. This creates a gap between data providers and data consumers, typically causing unneeded conflict, misunderstandings, and negative experiences.

Central data teams—regardless of whether they manage a data warehouse, a data lake, or both—often end up in a middleman role and quickly turn into a roadblock as the number of data sources multiply and, at the same time, as the number and complexity of business use cases increase.

Data mesh helps you derive value from big data and retain agility in complex and large environments through decentralization.

As heard on the podcast Coding Over Cocktails hosted by Toro Cloud, Zhamak Dehghani described data mesh, the new decentralized scheme as:

“The data mesh, at heart, tries to solve the problem of scale. If you fast forward life 15 or 20 years down the track in the future and imagine that everything that we do is having augmented intelligence, using data and the data that feeds those models can come from anywhere, any source on the planet – then it just doesn’t make sense to have centralized solutions.”

The ever-changing business landscape in terms of the diversity of data usage demands a reduction in operational costs and data-driven optimizations at scale. These aspirations act as prompts for big data organizations to adopt a data mesh architecture. 

Triggers for Data Mesh: What is Data Mesh
Reasons Driving The Data Mesh Approach

If you can identify with one or more of the following, data mesh may be of assistance to you:

  • Your data consumers are having a hard time identifying or accessing data.
  • Your central data team is dealing with a long backlog of issues, such as maintaining data models and catalogs, deciding on access rights, and calculating the costs of back-and-forth data transmission.
  • Your current tech stack with a data lake/warehouse fails to scale as data sources and consumers grow, making it incapable of materializing data-driven value.
  • Your centralized data management platform asks for adjustments to the whole data pipeline for faster querying and is currently unable to respond at scale.
  • The time and effort required to follow data governance standards are becoming a bottleneck for your data processing and analysis teams. They are struggling to gather vital business knowledge, which is undermining their competitive advantage.

If you have any of these challenges, but only to a moderate degree, or if the hassles they create are minor, you may not want to invest in a cannon to kill a fly. Data mesh is less likely to be a wise investment. 

But let’s say if you see any intersection of the issues listed above, or if your company has to scale up a complicated socio-technical architecture in order to accommodate anticipated tech development and create value from data, the likelihood that data mesh will work for you in such a situation is high.

Data Mesh Outcomes for Big Data Companies

At a high level, data mesh aims to achieve three major outcomes for all big data companies:

  • Fluid response to changes like business complexity, volatility, and ambiguity 
  • Agility in the face of expansion
  • Improved ratio of value from data to investment
Outcomes of Data Mesh: What is Data Mesh
3 Key Outcomes of Data Mesh

The present data industry is rapidly changing in terms of volume, velocity, and variety of data. Although our analytical data architectures have undergone significant improvements in resource allocation and decomposition, centralized data repositories like data warehouses or data lakes still fail to meet data at scale. In the diversity of multiple sources, consumers, and transformations, data warehouses and lakes create tension and friction in both the architecture and organizational structure. 

A Twitter thread here explains the general perception of all the central repositories and how they have evolved over the years.

Twitter Thread on Data Mesh: What is Data Mesh
Twitter Thread Explaining Data Repositories

Data mesh addresses these monolithic bottlenecks by introducing a peer-to-peer approach to data collaboration when serving and consuming data. It enables data consumers to directly discover and use data from the source data products without the intervention of a central team or data repository. 

Within the distributed structure of the data mesh, data is seen as a product, with each business unit owning a distinct domain. This decentralized data ownership model enables business units and operational teams to rapidly and easily access and analyze “non-domain” data, thereby reducing the time to insight and time to value.

Let’s try to understand how data mesh streamlines a user experience in discovering, understanding, trusting, and using quality data with its four cornerstones.

4 Core Pillars of Data Mesh

Four Pillars of Data Mesh: What is Data Mesh
4 Core Pillars of Data Mesh

The four guiding pillars of data mesh—data as a product, domain ownership, self-service, and federated governance help us understand how it functions.

These pillars don’t limit you to Java, Apache Kafka, or relational databases because they are all independent of any one technology. A well-built data mesh is adaptable in that it enables you to add more compute as necessary, as well as in that it can accept modifications as a business develops, expands, and changes—along with the things that consumers desire from the data.

Domain-Level Data Ownership

The first principle of data mesh enables decentralization and distribution of data responsibility to people who are closest to the data. This is because the functional groups or domains closest to the data are best able to understand what analytical data exists and how it should best be interpreted. 

In most businesses, the common entity domains we see are customers, sales, and marketing. But in big data companies, there can also be nuanced domains (or subdomains), such as those that only collect clicks data produced by a web page. In essence, the definition of a domain varies from business to business but might be any subject that is intricate and significant enough to warrant the development of expertise. 

From an architectural standpoint, data decomposition indicates that business domains or their subdomains should be used to create ownership boundaries rather than using systems, technologies, or process steps as the leading rationale.

Using this concept, data mesh aims to develop domain boundaries and domain experts, who have the power and the capabilities (i.e., skills and resources) to make sound decisions and extract the maximum value possible from a domain’s data. However, it is crucial that they also have the accountability and skills to handle the results of their choices. When responsibility and execution are concentrated within one area rather than being outsourced to centralized organizations, efficiency is greatly increased since issues are resolved where they first arise.

The main driving force behind this decentralization is to improve scalability with an increase in the variety and quantity of data sources and data-consuming applications.

Improved Scalability with Decentralized Responsibility

Data as a Product

Each team that publishes data in a data mesh system should view data as a product. Data is owned by teams in the same way that teams would own the collection of services used to implement the part of the company they serve. The data must be thought of as a product by that team because they are alone in charge of its quality, representation, and coherence.

Treating data as a product means bringing and developing a product thinking approach. It will imply that someone had deliberate control over its creation, was accountable for it, and was primarily responsible for its quality. 

In the context of data mesh, this will be the duty of the data product owner, who creates the data product, and the data product developers, who build it. A data product has a specific name and a set of defined attributes, similar to a product on a shelf in a store, such as:

  • Level of quality
  • Level of availability
  • Security rules
  • Frequency of updates
  • Specific content

Data as a product also introduces a degree of standardization to help your teams to incorporate every single element into the larger data mesh ecosystem. To classify a set of data as a product, we recommend that you check for three attributes:

  1. Self-described and discoverable data in the ecosystem of data mesh
  2. Addressable data asset bearing a unique address
  3. Interoperable data (connected data across domains using standards and harmonization) following predefined standards, and established SLA
Data as a Product: What is Data Mesh
Data as a Product: Pictorial Representation

Self-Serve Data Infrastructure as a Platform

Data mesh scales out sharing, accessing, and using analytical data in a decentralized manner. Across the entire business, a self-serve data platform service is developed to manage the full life cycle of individual data products and remove the friction of data sharing from source to consumption.

For instance, if you were creating a sales prediction for Japan, you could get all the data required to create that report – preferably within a short period of time. You would be able to easily transfer all the data you require from all the locations where it is stored into– a database or reporting system that you are in charge of.

Building and sustaining data products involves a lot of resources and a highly specific set of skills (ranging from computational environment management to security). The viability of the entire data mesh concept would be threatened by multiplying the needed effort by the number of data products. 

The self-serve data platform’s goal is to centralize repetitive and generalizable operations to the extent required (again, depending on the business environment!) and to provide a collection of tools that abstract away specialist skills. For both makers of data products and users of data products, it would lower entry and access barriers.

Self-serve Data Infrastructure Platform: What is Data Mesh
Self-Serve Data Infrastructure Platform

Federated Computational Data Governance

The final pillar of data mesh is to automate and federate data governance across all involved members. It seeks to give the ecosystem of disparate data products a single structure and interoperability. The goal here is to enable autonomous data products to operate in one true data mesh rather than merely as standalone ones. The governance execution model here heavily relies on codifying and automating the policies at a fine-grained level for every data product, via the self-serve data platform services.

Although data mesh appears to add another level of complexity to the already expansive domain of data governance, it helps you manage a transfer of duties to data products. Each of the data products must have procedures for handling owned infrastructure, code, and data (and metadata) securely and effectively.

Additionally, data governance processes must strike a balance between the autonomy-driven flexibility and creativity provided to data product teams and the company-wide cohesiveness of data solutions. This balance is typically established through standardization. It is crucial to realize that finding a balance between the central government and local autonomy will never be easy. It will always be determined by the characteristics of your company.

Having discussed the four core pillars of data mesh, here is an infographic to summarize how the four components complement each other and address the challenges that may arise from others.

Interplay of Different Pillars: What is Data Mesh
Roles of The 4 Core Pillars

What the Creator Has to Say: Zhamak Dehghani on the “Socio-Technical” Paradigm Shift

Zhamak Dehghani curated the idea of the data mesh as a paradigm shift in big data management. She calls it a “socio-technical” approach in her book. Let’s hear what she has to say regarding the new structured approach.

The “Socio” Component of the Paradigm: How Data Mesh Unites Domain-Focused and Infrastructure Personnel

  • To address the increasing number of data sources and systems, data mesh replaces the bottleneck of the centralized monolithic lake or warehouse with smaller, more agile data domains.
  • It gives more flexibility in the face of ongoing changes in data models and data structure by delegating responses to the people who know the data the best – domain experts in their respective disciplines.
  • It forces domains to adopt a “data as a product” mentality in which they standardize the data sets to make them available to the rest of the business and treat data quality the same way they would treat product quality.
  • Likewise, it is intended to operate with, rather than against, a highly complex and dynamic organizational environment by allocating data pipeline activities to the area where business objectives are controlled.
  • Domains can still freely interact while retaining their own distinct business jargon and language in their data. For instance, regardless of whether the domain dataset specifies the nation as “U.S.,” “U.S.A.,” “USA,” or “United States,” they can exchange U.S. sales data.

The “Technical” Component of The Paradigm: How Data Engineering Practices are Impacted by Data Mesh

  • Data Mesh increases data integrity and trustworthiness by combining the analytical and operational computing planes and bringing analytical data closer to its source.
  • It minimizes inadvertent data duplication and pipeline complexity. Greater freedom for LOB (line of business) owners to expand, innovate, and implement minor changes quickly is supported by a less complicated data pipeline. Less copying decreases storage expenses as well as the effects of data drift, which threatens data quality.
  • Instead of emphasizing data input, data mesh focuses on data serving.
  • In order to comprehend the health of data assets throughout their life cycle, data mesh takes into consideration the requirement for consistent data observability, governance, and discoverability layer.

How to Get Started with Data Mesh?

When attempting to create a data-driven business, one of the most common fallacies is that technology can compel change. People search for the shortcut, the simple solution, or the tool to use that will address all of their difficulties when they learn about a new notion that is claimed to improve all of their data-related issues. 

Most data professionals are now aware that even a technological upgrade is sometimes a mid to long-term project, particularly when it necessitates the transfer of hundreds or thousands of internal data use cases. To adopt a data mesh, you need to weigh all the needs and implications before making the final call.  

  • Are you facing a large proliferation of data sources and use cases in your organization?
  • Do you identify data enabling ML and analytics as a strategic differentiator?
  • Do you have the vision to use data to perform current activities differently?
  • Do you want to build and create technology that will allow for data sharing and consumption at the core of each business function?

If you answered yes to these questions, you are probably in the right mindset to adopt a data mesh. Here, we share a four-step approach to executing data mesh in an iterative, value-driven way. 

Develop a Data-Product Centered Mindset 

Data mesh is mostly a shift in thinking about how to interact with data on an organizational scale rather than a technical issue that can be resolved by simply implementing a new set of tools. It takes time to alter people’s perspectives on using big data. To begin an organizational transformation in the data arena, the best recommendation we can offer is to start small.

As with any organization, yours too will have innovative teams that are modern in their working methods. These are the ideal candidates to sow the seeds of change by creating a first data product that can serve as a model for the rest of the firm. 

Find an accountable team member to undertake the function of the data product manager. The data product manager will be in charge of finding the domain experts, talking to them, finding out what they need, and setting the exact requirements for the data output that is needed.

Create or Modify Existing Data Infrastructure Alongside Evolving Data-Product

If the initial data product team is starting from scratch, it’s better to plan the infrastructure requirements first and then begin by setting up and running the infrastructure required for their particular use case. 

We don’t think it’s a good idea to build the data infrastructure platform before the first data product. This is because there is a big chance that the platform will be over-engineered or built in a way that doesn’t meet the needs of the first data products. Instead, we recommend making the first data product and the basic data infrastructure at the same time. Then, iterate and improve both the infrastructure and the first data product to make a minimum viable product (MVP). 

When you start to see that subsequent data products begin to have repeating requirements, the infrastructure elements that were developed as a part of the initial data product can then be extracted to become the first capabilities of the self-serve data infrastructure platform.

Data Product Development: What is Data Mesh
Data Products Development Process

A vast majority of other organizations will have to interact with pre-existing infrastructure, which often consists of data warehouses or data lakes and the central infrastructure teams that run them. Once you are aware of the infrastructure capabilities you require, you should look at the capabilities that are currently in place once again. If you discover a match between supply and demand—amazing—collaborate with the infrastructure team and adopt the necessary changes.

With the approach we’ve discussed so far, you may be able to create a successful example of a production setup that creates value for your company, not just a proof of concept. If you can set a good example, it will be much easier to get other teams to use similar strategies. It will also give your top management confidence to keep supporting you.

Scale the Mesh with Self-Serve Data Infrastructure

Data-mature companies usually begin their journey with some type of platform infrastructure team already in place. These teams know a lot about the problems that come with having a central responsibility. It’s crucial to realize, however, that moving toward a self-serve, data-agnostic architecture typically calls for more substantial adjustments, which again require time and money. Expecting your present central infrastructure team to activate the data mesh on their own while still carrying out their current duties is unrealistic.

Teams in charge of central data processing systems frequently participate in the access-giving process, providing requestors with all the details: the data requested, use case, who is part of the request, and so on. This solution appears fine but hampers scaling in certain circumstances caused by a lack of subject matter expertise and a large number of access requests.

How could we handle these troubling scenarios? The obvious way is to give central and domain data teams control over the computing infrastructure in a different way.

Find a balance where the central team can provide a capability that eliminates the biggest pain point (moving from managing infrastructure to filling a template) while still allowing the domain teams to fully own the use cases they are working on. This prevents the overloading of central data teams. 

A self-serve data platform should be designed to remove friction from source to consumption. It should abstract data management complexity and reduce the cognitive load of domain teams in managing the end-to-end life cycle of their data products. This self-serve platform should:

  • Ensure that autonomous teams can benefit from data.
  • Exchange value with autonomous and interoperable data products.
  • Reduce the cognitive load to speed up the exchange of value.
  • Scale-out data sharing.

Sustain the Mesh with Federated Computational Data Governance

Creating a data mesh on a corporate scale tackles a variety of data-related issues. Decentralization speeds things up, but there’s always a chance that silos may form that only see their own environment and have no perspective beyond it. This causes misalignment of data products that eventually lead to long delays and wastage of man-hours in creating a consolidated view of multiple data products.

We must incorporate the last layer of the data mesh idea in order to combat this natural drifting effect and work toward organizational alignment. You need to consider how federated data governance might help to attain interoperability. You should consider the semantics required to combine data from several domains. This frequently entails taking a look at concepts that span domains, like users. It is not guaranteed that a term’s semantics—such as its identifier uniqueness—are shared across all domains.

The federated governance group’s first objective is to gather delegates from various areas to identify polysemes. Polysemy is a language theory term that describes words or phrases that might have distinct but related interpretations depending on the situation. Following the discovery of such polysemes, it is necessary to model the relationships between their various meanings.

Individual domains or transient cross-domain working groups should be given responsibility for the implementation of mappings and the modeling of relationships between polysemes. For example, concepts such as “lead,” “session,” and “active users” can be interpreted differently in different contexts. Different domains can be linked by a cross-domain mapping that translates between the various interpretations.

Polysemy: What is Data Mesh
Polysemy: Pictorial Representation

Federated computational governance seeks to provide trustworthy data governance and efficiency through platform automation while granting autonomy to data product teams. To accomplish this, tooling that enables data product teams to define their data in a manner that can be universally accepted must be made available to them.

Case Study: The Netflix Model

Over the years, Netflix has relied heavily on real-time processing technologies to hold on to its crown as the market leader in the digital entertainment industry. To handle its initial use cases, Netflix implemented the Keystone stream processing platform. 

As its business expanded and new use cases were discovered, Netflix had to re-evaluate its options moving forward. After thorough research, the team decided to implement a data mesh to address all their current and future use cases. 

For Netflix teams, the answer to “what is data mesh” looks something like this:

Data mesh is a general-purpose data movement and processing platform for moving data between Netflix systems at scale.

The project was an attempt to address the need for change data capture. An increase in demand over the previous year for all other kinds of requirements in fields like machine learning, logging, etc. prompted them to implement data mesh. 

As the system developed, more and more use cases were unlocked, and Netflix expanded the data mesh implementation to enable not only CDC use cases but also general data movement and processing use cases to:

  • Allow events from apps with a broader scope (not only databases).
  • Handle a growing number of database connections – CockroachDB, Cassandra, etc.
  • Explore and use additional processing patterns, including filters, projections, unions, and joins.
Netflix Data Mesh: What is Data Mesh
Netflix Data Mesh Structure

More details on how Netflix implemented data mesh can be found in this blog post.

Practical Use Cases of Data Mesh

  • DevOps and IT: Data mesh offers a cutting-edge development methodology for software and data analytics teams. By enabling immediate access and querying ability to data from nearby locations without access restrictions, it lowers data latency.
  • Marketing and Sales: The distributed data architecture helps sales and marketing departments construct a 360-degree view of consumer behaviors and profiles from multiple systems and platforms. This allows them to design more focused campaigns, improve lead scoring accuracy, and forecast customer lifetime values (CLV), churn, and other crucial performance indicators.
  • Training in AI and ML: Data mesh facilitates development and intelligence teams to build virtual data warehouses and data catalogs from different sources and feed Machine Learning (ML) and Artificial Intelligence (AI) models to help in their learning without having to consolidate data in a single location.
  • Loss Mitigation: Data mesh adoption in the financial sector accelerates time-to-insight while lowering operating expenses and operational hazards. In order to identify and stop fraud in real-time, distributed data analytics optimizes models of fraudulent behavior. It enables multinational financial institutions to evaluate data locally – inside a specific country or area – to discover fraud concerns without reproducing and sending data sets to their central database.
  • International Business: A decentralized data platform simplifies compliance with global data governance laws in order to enable global analytics across various locations while maintaining end-to-end data sovereignty and data residency compliance. 

Final Thoughts

Data mesh is a solution for organizations to get value from data at scale. It arose to address the ungovernability of data lakes and the bottlenecks of monolithic data warehouses. 

Because of its novel, decentralized design, and federated governance, data mesh enables end users to easily access and query data where it is without transferring or modifying it first. It functions as a multiplane platform that matures as the volume and diversity of data products, data producers, and consumers increase. As your business expands, concentrate on creating tools to tag, catalog, orchestrate, and search your data and organize them effectively using the principles of data mesh.

Executing data mesh needs a multifaceted organizational change. You need to devolve the organization’s decision-making structure to domains and automate the traditional manual processes to scale a data mesh implementation. As with any other change, moving to data mesh requires making changes to how you think, what you can do, and how you do things, which is why we suggest you start small. Start small and show early triumphs to prove change’s success. These early wins help get people excited about making big changes.