Microservices Event Sourcing vs CDC Using Debezium

on CDC, Debezium, Microservices Event Sourcing • February 3rd, 2022 • Write for Hevo

Microservices Event Sourcing Cover image

This article is a dive into the realms of Microservices Event Sourcing and how this compares to using Change Data Capture (CDC) with Debezium in your microservices architecture. Much needed clarity on the value of these solutions will be presented. Additionally, two different design patterns will be explained in detail with the pros/cons of each.

So why should you care about these solutions? Well, the current age of containerization and cloud computing has led to an exponential growth in the adoption of microservices and distributed data stores. To stay competitive, forward-thinking organizations are racing to take advantage of the increased agility, flexibility, resiliency, and scalability that distributed applications deliver.

Now, a common theme that you’ll encounter when engineering real-time data-driven microservices is that you should capture your raw data, and store that in multiple data stores to help decouple it, preserve it and be able to replay it or reprocess that data later on. A microservices application might persist data in multiple relational databases, in-memory caches, object stores, key-value stores, full-text search indexes, graph databases, or even a data warehouse.

But with so many moving parts, data in these data stores can easily run out of sync, get lost, or even become corrupted, therefore resulting in catastrophic failures in mission-critical applications. You should therefore give some serious thought to how you might capture that data from the different microservices and store it to downstream consumers in a way that makes it easier to manage over time.

Both solutions detailed in this article can help you solve this problem. Nonetheless, it is up to you to evaluate each proposed solution against your specific use case to determine which is the best for your microservices platform or application.

Table of Content

Database Design in a Microservices Architecture

Let’s imagine you are developing a ride-sharing application using the Microservice architecture pattern. Most services need to persist data in some kind of database. For example, the Driver Profile Service stores profile information about the driver, and the Driver Location Service stores and retrieves the driver’s location as well as their availability status.

Problem

Which database architectural pattern should you use in your microservices application?

Requirements

  • Services must be loosely coupled so that they can be developed, deployed, and scaled independently
  • Services need to update their database
  • Services need to send messages to other services and do it consistently
  • Some business transactions must update data owned by multiple services.
  • Some business transactions need to query data that is owned by multiple services.
  • Databases must sometimes be replicated and sharded in order to scale.
  • Different services have different data storage requirements. For some services, a relational database is the best choice. Other services might need a key-value database such as Redis, which is good at session management, or Amazon Neptune, to power graph applications.

Solution

Keep each microservice’s persistent data private by using a database per service approach. Other services shouldn’t access schemas or tables that they don’t own. Instead, data should be exposed via the Microservice’s API.

Resulting Context

This helps to ensure that the services are loosely coupled in such a way that a change to one service’s database will not have any impact on other services.

Also, each service in the system can make use of the type of database that is best suited to its needs. For example, a service that performs analytics can use Redshift whereas a service that does full-text searches can use ElasticSearch.

However, this approach introduces some complexities. Firstly, it’s going to be a challenge to manage multiple databases. Executing business transactions that span multiple microservices as well as queries that aggregate data from multiple databases is challenging.

So how do we solve this edge case? Event Sourcing and CDC with Debezium are two solutions that can help solve this issue by building solutions that are message and event-driven.

What is Microservices Event Sourcing?

In the world of relational databases, you often see a pattern where you persist entities by updating them in place. For example, let’s take an invoice record with 3 columns and an ‘issued’ status to start with.

{id:1, amount:50, status:'issued'}

30 days later you run a job that identifies that invoice hasn’t been paid yet and you update the status in the row in that table to ‘overdue’.

{id:1, amount:50, status:'overdue'}

10 days later, the customer pays that invoice and you update that column to ‘paid’.

{id:1, amount:50, status:'paid'}

One potential issue with this approach is that you’ve lost some of the rich histories of this entity over time. The fact that the invoice was in a period of ‘overdue’ for some time is useful information. You might want to run some analytics down the road to identify which customers had trouble paying on time.

Let’s compare that to the approach taken by event sourcing which is to store an order sequence of state-changing events. We start off with an IssuedEvent where you create an invoice with the same columns as before.

From that IssuedEvent, you can maintain a running point-in-time snapshot summary of what the invoice currently looks like. 30 days later, you run the same job that identifies that the invoice is overdue and you create an OverdueEvent to update the status to overdue and update your running summary/snapshot with that information.

Finally, 10 days later you create a PaidEvent that updates the status to ‘Paid’ and you update the snapshot with this information as well.  

IssuedEvent {id:1, amount: 50, status:'issued'}

OverdueEvent {id:1, amount: 50, status:'overdue'}

PaidEvent {id:1, amount: 50, status:'paid'}

invoice {id:1, amount: 50, status:'paid'}

With this approach, you’ve got a complete history of the invoice entity over time. This allows you to run analytics on all the data that you have and interestingly it also allows you to build that summary snapshot by replaying the events in order. So if you want to present this data differently to a different set of customers, you can maintain 2 different presentation layers or if one becomes corrupted you can rebuild it from a replay of all the events.

Therefore, from the example, you can define Microservices Event Sourcing as an architectural pattern in which the state of the application is determined by a series of events. The systems that use the Microservices Event Sourcing pattern store all changes in an application’s state as a sequence of events. You don’t have one state stored in a relational database.

Usually, you have the state being computed by certain events. A similar example is a transactional database that stores any event/changes in state in a transaction log. The state is persisted as a series of events in a place called the event store. New events are appended but they don’t overwrite old ones. As a result, all the history is maintained.

There are usually other patterns used in conjunction with Microservices Event Sourcing. They are:

Domain-Driven Design

This is a software design philosophy that focuses on the domain model behind your application. Most of the terminology used in Microservices Event Sourcing comes from domain-driven design.

CQRS (Command Query Responsibility Segregation)

This is a design pattern that allows you to separate the read and write paths to your data. It does this by separating commands from queries. Reads/writes have different concerns.  When you write to a database, you are writing entities but when you do reads, you are reading aggregate data.

It, therefore, becomes impossible to optimize for both reads and writes unless you separate them. In other words, if you want to speed up the reads, what you can do is to continue to add indexes to the table. However, writes will slow down and this forces you to remove indexes to speed up writes but slow down reads.

CQRS solves this by having a separate model for queries (presentation/aggregate model) and a separate model for commands (entity-based model) that use separate data paths for maximum performance.

Distributed Logs

This is the most popular method of implementation and storage of the events that are generated in the Microservices Event Sourcing approach.

Simplify Data Analysis with Hevo’s No-code Data Pipeline!

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process.

Hevo supports 100+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE[/hevoButton]

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
SIGN UP HERE FOR A 14-DAY FREE TRIAL

The Microservices Event Sourcing Design Process

Terminology Used in Microservices Event Sourcing

Commands – These are the instructions that your application issues to the Microservices Event Sourcing system.

Event – As you process commands, you turn them into events that are immutable facts or a sequence of state changes.

Aggregate – Built-up view of your entity.

Projection/Materialized View – This is the view that you build from your event-sourced sequence of events to present to downstream consumers.

Design Process

  1. Model your domain as a series of commands, events, aggregates, and views.
  2. Design the command handlers or the code that will generate the events and store them in the message broker (primary data store).
  3. Design the event processor(s) on the other side of your message broker (i.e., Kafka or RabbitMQ) that will create your projection.
  4. Finally, you design the read side that will read the data from that projection e.g. BigQuery or ElasticSearch.

Change Data Capture (CDC) with Debezium — What is it About?

Most of the time, microservices applications start with a few services and a minimal data footprint in their architecture. If you consider a simple web application, initially, a single NoSQL or relational database can fulfill its needs. For example, a MySQL, PostgreSQL, or MongoDB database server is more than enough to handle CRUD operations for the application.

However, as the application evolves and starts getting more traction in the market, you need to add more services that use different data storage systems in their architecture to provide a better user experience. This forces you to deploy multiple data systems that support different data formats and data access patterns. For example, you will need:

  • Cache – to speed up your reads
  • Search Index – to perform a full-text search across your application
  • Data Warehouse – for rich analytics.

Practically speaking, there is no single database that can satisfy all these needs simultaneously. This forces us to store our data in multiple places, in a redundant and denormalized manner. When you have multiple data systems in place, you need to classify the data that is stored in them. You can classify the data into 2 categories:

  1. Systems of Records (SoR)
  2. Derived data

When you have more than one version of the same dataset, you need to appoint an authoritative version or source of truth for that so that when there is a discrepancy across versions, the source of truth will be accepted as the correct one. This is the version that we refer to as Systems of Records or source data.

The very first time a user creates data, it’s captured in SoR. For example, when a customer creates an order it’s first recorded in the Orders database. Other services can then take this data, apply some transformations and store it for different purposes. This data is called derived data. It’s redundant and in a denormalized form. If a service loses the derived data, it can always recreate it from the source.

Source data and derived data should not be kept in silos. Instead, both have to be synchronized to make the application’s state consistent. A change made in the database has to be reflected in the search index, cache, and ultimately in the data warehouse. Therefore, there should be a way of capturing the changes done in the Microservice’s data store and propagating them to derived data systems and services in a reliable and scalable manner.

This is where the practice of CDC comes into play. CDC is the process of observing all data changes written to a database and extracting them in a form that can be replicated to derived data systems. You can think of CDC as a process that continuously watches the transaction log of the source data system for any changes, extracts them out, and propagates them to downstream systems.

CDC with Debezium

Debezium is an open-source CDC tool that was initiated by RedHat. Debezium’s architecture centers around Apache Kafka which is used as the Event Bus. There are large production deployments of Debezium at multiple companies e.g. Shopify, Vimeo, WePay, Ubisoft, and many others. Log-based CDC for multiple databases is the main reason why most companies use Debezium. A Debezium connector based on Kafka Connect is used as the connector for log-based change detection and propagation.

In terms of connectors, the challenge with the log-based approach is that there isn’t one single API that you could implement and then have log-based CDC for all databases. Instead, you need to have bespoke and dedicated connectors for each database. Currently, Debezium has connectors that allow you to pull change streams from Oracle, SQL Server, DB2, PostgreSQL, MySQL, MongoDB, Vitess, and Cassandra. They then send that data to Kafka.

In a nutshell, Debezium is like the observer for your database. Whenever a change happens in your database i.e. INSERT/UPDATE/DELETE, Debezium will react to this change by tapping into the transaction log, capturing this change, and sending it to the consumers. Changes are propagated to consumers via Apache Kafka.

Whenever a transaction gets executed in a database, the database will append entries to its transaction log. Debezium goes to the transaction log and extracts the changes from there. This means you get all the changes in the right order.

How to Use Change Data Capture (CDC) with Debezium for Microservices Architecture

Requirements

  1. Message ordering guarantee – The changes must be delivered in the order in which they occur otherwise there can be inconsistent states in downstream systems.
  2. Support for asynchronous Pub/Substyle change propagation – That way, downstream systems can be added or removed at any time without impacting the overall system.
  3. Reliability and resiliency – When it comes to the delivery of messages, the CDC system must support at-least-once message guarantee. Cannot tolerate message loss since if a downstream system misses a change event, it can make the whole system inconsistent.
  4. Support for light-weight message transformation – This is because the data formats expected by downstream systems may be different.

These requirements are difficult to achieve but they are necessary. The above requirements have already been proven for building asynchronous, loosely coupled, and massively scalable Microservices systems.

We can adapt a system to fit this data access pattern by introducing an Event Bus. Soon after a change is detected by the log mining component, in this case, Debezium, it is published as an event to the Event Bus. Debezium creates a Kafka event for every database change by tailing the log of all the database changes and publishing it to the Kafka message broker. The Event Bus will take care of propagating the event into the target systems. 

You, therefore, get the complete information about the row before and after the change such as what sort of operation it was, when it happened, what server it came from, and a lot of other useful metadata. When a connector is first created, Debezium will take a Snapshot of all the configured tables before capturing the changes.

So there are three stages to this:

  1. Change event generation
  2. Change event ingestion 
  3. Change event propagation

Comparison: Microservices Event Sourcing vs CDC with Debezium

Now that you have covered both Microservices Event Sourcing and CDC with Debezium, let’s now look at some of the benefits and drawbacks of both systems.

Benefits of Microservices Event Sourcing

  1. The source of truth (SoR) data is easily accessible which I great for auditing purposes
  1. An event-sourced architecture is highly performant for a high volume of write operations to the SoR database
  1. It supports sharding for huge datasets (depending on datastore)

Drawbacks of Microservices Event Sourcing

  1. The first drawback is that Microservices Event Sourcing does not adhere to the strongly consistent Mayra  Rather, the data become eventually consistent.
  1. From a query perspective, it’s impossible to read your  writes to the ledger
  1. Microservices Event Sourcing requires you to write a lot of extra code to compensate for error cases
  1. Microservices Event Sourcing does not promise any transactional guarantees for resolving the dual writes flaw which could result in missing data
  1. You need to account for backward compatibility with legacy data because the events data format can change over time
  1. An event-sourced system does not guarantee snapshots of the ledger. You have to provide measures for that and the implications associated with it
  1. There are very few developers who have experience using Microservices Event Sourcing

Benefits of CDC with Debezium

  1. The Source of truth data is maintained in the application’s database tables and transaction log
  1. It offers reliable messaging and transactional guarantees and this minimizes any possibility for data corruption or loss
  1. It’s a more flexible solution
  1. It is easier to maintain due to its simple design
  1. It allows you to read and query your own writes
  1. It is strongly consistent within the application’s database and eventually consistent across the entire the system

Drawbacks of CDC with Debezium

  1. This approach can add extra latency to the database when reading the transaction log or when polling the message broker

Conclusion

Your application’s microservices architecture has to use multiple data storage systems to cater to different data formats and access patterns. To consistently synchronize source data across these systems, CDC with Debezium offers a more comprehensive solution that eliminates the Dual Writes flaw. For many developer teams, it emerges as a more practical solution compared to an event-sourced approach for your average developer team.

visit our website to explore hevo[/hevoButton]

Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.

SIGN UP for a 14-day free trial and see the difference!

You can also have a look at our unbeatable pricing that will help you choose the right plan for your business needs!

Share your experience of learning about Microservices Event Sourcing vs CDC Using Debezium, below in the comment section.

No-code Data Pipeline For Your Data Warehouse