How we ensure Zero Data Loss at Hevo

• September 11th, 2017

Data pipelines

Hevo is used by companies to build reliable, fault-tolerant data pipelines for their business analytics. If you are a data engineer or a data analyst, you would know how data reliability is a big problem in business analytics. Incomplete and inaccurate data erodes companies’ faith in analytics as a whole and they start taking decisions on a hunch.

When we began building Hevo the one promise that we wanted to fulfill was data accuracy. We believe that it is the cornerstone of a company’s analytics infrastructure and not focusing on this aspect will mean letting our customers down.

So, in principle what it meant was whenever an event enters Hevo’s data pipeline it is guaranteed to reach the destination. We choose to build our data pipelines on Kafka which gives at-least once guarantee.

As shown above, our pipelines have four essential stages – ingestion, transformations, mapper, and destination. In a happy scenario, an event will enter the ingestion stage and will go through different stages till it reaches the destination. As long as the application doesn’t leak events, Kafka itself will make sure that no data is lost.

So far so good. But, what if there is an event that can not be processed for logical reasons. Say, a buggy transformation, or a data type mismatch or when the destination is unreachable. These failures may happen all the time and we needed a safe mechanism through which we could temporarily hold events for the user to investigate, fix and replay.

Inspired by SQS’s dead letter queue, we evaluated if having another topic in Kafka to hold these events would be enough. Parking these events in a Kafka topic would mean the order is still maintained and delivery is guaranteed. But, this approach has a few issues:

Logical partitioning of events

As we expect the user to investigate these failed events, we wanted to logically partition these events so that he could take care of one error at a time. We came up with the following dimensions on which failed events could be partitioned:

  • Pipeline
  • Event Name
  • Stage (at which event was parked)
  • Error Type (e.g. missing columns, data type mismatch in a column, unreachable destination)

Apart from these dimensions we wanted to give the user the ability to filter events on any time window.

A typical analytics setup will have 10-20 pipelines, each pipeline having 200-500 different event names. These combined with 4 stages and over 50 possible error types may result into over 500K logical partitions for each time interval. We will need one topic per partition so that the user can pick and replay events from a particular partition.

This didn’t seem to be a scalable approach.

Retention

Kafka forces you to have a retention policy based on time or size. This would mean we can’t keep these events forever in the queue if the user needs time to fix issues. This restriction doesn’t go well with the basic premise of ensuring zero data loss. A long retention policy won’t help either as it makes things a bit unmanageable.

Manageability

Keeping track of thousands of topics that hold your parked events would mean we are using Kafka as a persistent data store. It will create a management overhead that comes with any persistent store.

Given above reasons meant we had to ditch the “Dead Letter Queue” design.

We went back to the drawing board and came up with a rather primitive way of managing data: Write Ahead Logs (WALs). WALs are fast, can be partitioned dynamically and stored forever on S3.

The current design gives us complete control over partitioning, retention, and S3 takes away any management hassles. When a user replays a partition we essentially replay all the files that belonged to that partition in the same order as they were written.

The design has some other benefits too:

Discarding events is cheap.

The user may want to discard old failed events for various genuine reasons. To do that we just mark these files as discarded in our tracker. The clean-up job later removes all the replayed as well as discarded files from S3.

Metadata is accurate and readily available

While writing files to S3, we store other metadata (number of events, size in bytes, time stamps) in the tracker. This metadata serves as an important set of cues for the user to debug the issues.

Portability

Files are extremely portable. User can download it for further debugging, whereas Kafka topics would require us to find other ways to allow the user raw access to failed events.

We at Hevo, are committed to help our customers build data pipelines that yield accurate data in real-time. This design approach goes a long way in fulfilling that commitment.

We would love to receive any feedback on our design approach. Give us a shout at dev@hevodata.com or chat with us through our website.

No-code Data Pipeline for your Data Warehouse