Benchmarking Hevo’s Pipeline Engine: 4x Faster Ingestion & 10x Mapping Gains

Table of Contents

We built this to be reliable. That’s still true.

When we built Hevo’s pipeline engine, we made a deliberate set of choices. Simplicity over cleverness. Reliability over raw speed. Transparency over silent magic. A pipeline that moves data correctly, predictably, and visibly – every time.

Those choices paid off. Thousands of pipelines running across 2k of our customers. Connectors to 150+ sources. Data flows into Snowflake, BigQuery, and Redshift reliably, day after day.

But data doesn’t stand still. The volumes are larger. The expectations are higher. And the teams depending on these pipelines: Data engineers, analytics leads, and RevOps teams need more from infrastructure than just correctness. They need speed. They need it to scale without thinking about it.

Reliability is the baseline. Performance is the promise we make on top of it.

So earlier this year, we did something we don’t do lightly. We benchmarked ourselves. Not against a competitor. Against our own standards – using real customer data profiles, real schemas, and real volumes. And we asked, “Is this pipeline as fast as it should be?”

The answer was ‘It’s good. ‘And we can make it significantly better.

Why this matters now: AI runs on your data

There’s a reason pipeline performance has moved from “nice to have” to “critical infrastructure” in the last two years. AI.

Every LLM integration, every ML model, every AI-powered workflow in your product starts with one question: is the underlying data fresh, complete, and trustworthy? A pipeline that takes three hours to sync isn’t just slow – it’s a ceiling on what your AI can know.

The principle: AI is only as good as the data underneath it. And the data is only as good as the pipeline moving it. Speed, reliability, and transparency aren’t pipeline features – they’re AI features.

This is what shaped our benchmarking work. Not just “How do we go faster?” but “What does a pipeline need to look like when it’s the foundation of an AI-powered data stack?”

Three things matter:

Fast enough that freshness is measured in minutes, not hours
Reliable enough that you never have to wonder if the data made it
Transparent enough that when something changes, you know immediately

The benchmarking work we’re describing here is about the first one. The other two have always been core to how Hevo is built – and they remain non-negotiable.

What the benchmarks showed

We ran structured benchmarks using customer-inspired data profiles – real schemas, realistic row counts, representative data shapes. Five runs each. We measured at the component level, not just end-to-end, so we knew exactly where time was going.

The headline numbers:

180 min→ ~40 min
Ingestion time

3–4×→ ~1×
Ingestion-to-load ratio

4x
improvement in querier

10x
faster
mapping

*This is for SQL Server

Ingestion was consuming 3–4× more time than loading. That ratio told us the work was in the pipeline engine itself – specifically in how data was being transformed, written, and compressed between source and destination.

The load side was already doing its job well. The real opportunity was on the ingestion side. Not because anything was broken, but because it had been designed for a time when data volumes were smaller and demands were lighter.

Things have changed since then. The scale is bigger, the expectations are higher, and the engine needs to evolve to keep up.

How we approached it: component by component

The methodology mattered as much as the fixes. Before writing a single line of improvement code, we isolated every major component of the pipeline and measured its individual throughput ceiling. Think of it like diagnosing a road network: you don’t fix the motorway until you know which junction is the bottleneck.

A Hevo ingestion pipeline has two major stages:

Stage 1 – Source reader: Reads from the external source, handles network interaction, and converts data into Hevo’s internal format. Source-specific. Optimisations here are per-connector.

Stage 2 – Destination mapper: Takes internal-format records, maps to destination types, writes to CSV, compresses, and uploads to staging. This stage is source-agnostic – every connector writing to Snowflake goes through the same code path. Wins here are universal.

We ran each stage in isolation – synthetic load on Stage 2 to find its raw ceiling, Stage 1 measured with a drop sink, so no downstream noise affected the reading. Once we had individual capacity numbers, we knew exactly where to focus.

What we changed

The mapper: from reactive to pre-compiled

The biggest win came from rethinking how we handle type conversion at write time. The original mapper was reactive – for every row, for every column, it looked up the source type, looked up the destination type, found the right converter, and applied it. Correct. Just not fast at scale.

The compiled mapper builds its conversion logic once, at initialisation, when both schemas are known. Every row after that just follows pre-built instructions. The per-row lookup overhead is eliminated. This change is universal across all destinations.

Serialisation: right-sized for the job

Standard libraries are built for the general case – every format, every edge case, every legacy scenario imaginable. That breadth comes at a cost. We audited our serialisation layer and replaced general-purpose implementations with narrow, purpose-built ones scoped exactly to what our destination actually requires. Less branching. Fewer code paths. The same correctness, at lower overhead.

A library that handles everything is a liability when you only need one thing.

Write path: eliminating conversion overhead

Data moving through a pipeline passes through several transformation layers before it’s written to disk. Over time, each layer boundary can introduce small inefficiencies – redundant conversions, unnecessary passes over the same data, and escape logic that runs on every value regardless of whether it’s needed. We audited the full write path, removed redundant conversions, and replaced multi-pass implementations with single-pass equivalents. Unglamorous yet impactful.

Compression: speed over ratio, CPU freed

On a 4-core machine with one mapper thread, we expected ~25% CPU usage. We were seeing 60–65%. The culprit was the background gzip compression thread – running at maximum compression ratio, using a small buffer that triggered frequent OS kernel calls.

We switched to best-speed compression and increased the buffer size. Network bandwidth wasn’t the bottleneck, so the lower compression ratio cost us nothing. The upload thread CPU dropped from ~35% to ~5%. That freed nearly a full core of headroom we can now use for additional mapper threads.

Parallelisation: balanced by data, not count

Historical loads run across parallel containers. The original distribution logic was object-count-based, which meant containers frequently received very uneven data loads. One container might process 120 GB while another handles 20 GB. The fast container sat idle while the slow one caught up. The worst-distributed container capped true throughput.

Old distribution – count-based, uneven load
Container 1: 120 GB → 55 min
Container 2: 30 GB → 18 min (idle 37 min)
Container 3: 50 GB → 28 min (idle 27 min)
Container 4: 20 GB → 12 min (idle 43 min)
Wall clock: 55 min (bottlenecked by Container 1)

New distribution – size-aware, balanced load
Container 1: ~55 GB → 38 min
Container 2: ~55 GB → 39 min
Container 3: ~55 GB → 38 min
Container 4: ~55 GB → 40 min
Wall clock: 40 min (all containers finish together)

The new approach distributes by source data size in GB, using metadata we already have at pipeline creation time. Containers start together and finish together. Parallelism that actually works like parallelism.

What changed: end-to-end

Area	Change	Impact
Mapper	Pre-compiled type conversion	Conversion logic is built at initialisation. No per-row, per-column lookups. Universal across all destination pipelines.
Serialisation	Purpose-built serialisers	Replaced generic implementations with narrow ones scoped to actual destination requirements. Less branching, fewer code paths.
Write path	Single-pass processing	Eliminated redundant double-conversion cycles at library boundaries. One pass where multiple passes were happening before.
Compression	Speed-optimised compression + larger buffer	Switched compression priority from ratio to speed. Increased buffer size to reduce system call frequency. Upload thread CPU: 35% → 5%.
Parallelism	Size-aware container distribution	Loads distributed by data volume, not object count. Containers finish together instead of waiting on the heaviest one.
Profiling	Component-level benchmarking	Each pipeline stage isolated and measured independently before any fix code was written. Bottlenecks identified, not assumed.

What this makes possible isn’t just faster pipelines. It’s a data foundation that can actually keep up with AI. When your sync time drops from hours to minutes, your models aren’t working off yesterday’s data. When your pipeline runs at full CPU utilisation – not idle, not throttled – your warehouse is always current. The teams building AI-powered workflows on top of Hevo’s pipelines shouldn’t have to think about whether the data underneath is fresh enough. That’s our job. This benchmarking work is how we make sure the answer is always yes.

The foundation was solid. We’re building on it – not away from it.

Note: The benchmarks and optimisations described in this post are based on MS SQL Server pipelines writing to Snowflake. This was our first source through the new methodology. Postgres, MySQL, and BigQuery results will follow as we complete each pass.

Questions about pipeline performance at your data scale? We’re happy to talk specifics.

Amit Gupta Vice President of Engineering, Hevo Data

Amit Gupta is the VP of Engineering at Hevo Data and a deeply hands-on leader with over 17 years of experience building products and teams from the ground up. He has led organizations of 60+ engineers and brings strong expertise across backend, frontend, big data, DevOps, and cloud technologies. At Hevo, he focuses on solving complex scalability and system design challenges, ensuring the platform reliably powers data movement at enterprise scale.

Shashank Rao Senior Staff Engineer

Shashank Rao is a Staff Engineer at Hevo Data with over 12 years of experience in software engineering and distributed systems. A polyglot developer, he brings deep expertise in building scalable, high-performance backend systems across data-intensive environments. At Hevo, his work spans core platform engineering, data pipeline reliability, and large-scale system design, giving him hands-on insight into the real-world challenges of moving and managing data at scale.

Benchmarking Hevo’s Pipeline Engine: What We Found and What We Changed