Table of Contents
We built this to be reliable. That’s still true.
When we built Hevo’s pipeline engine, we made a deliberate set of choices. Simplicity over cleverness. Reliability over raw speed. Transparency over silent magic. A pipeline that moves data correctly, predictably, and visibly – every time.
Those choices paid off. Thousands of pipelines running across 2k of our customers. Connectors to 150+ sources. Data flows into Snowflake, BigQuery, and Redshift reliably, day after day.
But data doesn’t stand still. The volumes are larger. The expectations are higher. And the teams depending on these pipelines: Data engineers, analytics leads, and RevOps teams need more from infrastructure than just correctness. They need speed. They need it to scale without thinking about it.
Reliability is the baseline. Performance is the promise we make on top of it.
So earlier this year, we did something we don’t do lightly. We benchmarked ourselves. Not against a competitor. Against our own standards – using real customer data profiles, real schemas, and real volumes. And we asked, “Is this pipeline as fast as it should be?”
The answer was ‘It’s good. ‘And we can make it significantly better.
Why this matters now: AI runs on your data
There’s a reason pipeline performance has moved from “nice to have” to “critical infrastructure” in the last two years. AI.
Every LLM integration, every ML model, every AI-powered workflow in your product starts with one question: is the underlying data fresh, complete, and trustworthy? A pipeline that takes three hours to sync isn’t just slow – it’s a ceiling on what your AI can know.
| The principle: AI is only as good as the data underneath it. And the data is only as good as the pipeline moving it. Speed, reliability, and transparency aren’t pipeline features – they’re AI features. |
This is what shaped our benchmarking work. Not just “How do we go faster?” but “What does a pipeline need to look like when it’s the foundation of an AI-powered data stack?”
Three things matter:
- Fast enough that freshness is measured in minutes, not hours
- Reliable enough that you never have to wonder if the data made it
- Transparent enough that when something changes, you know immediately
The benchmarking work we’re describing here is about the first one. The other two have always been core to how Hevo is built – and they remain non-negotiable.
What the benchmarks showed
We ran structured benchmarks using customer-inspired data profiles – real schemas, realistic row counts, representative data shapes. Five runs each. We measured at the component level, not just end-to-end, so we knew exactly where time was going.
The headline numbers:
| 180 min→ ~40 min Ingestion time | 3–4×→ ~1× Ingestion-to-load ratio | 4x improvement in querier | 10x faster mapping |
*This is for SQL Server
Ingestion was consuming 3–4× more time than loading. That ratio told us the work was in the pipeline engine itself – specifically in how data was being transformed, written, and compressed between source and destination.
The load side was already doing its job well. The real opportunity was on the ingestion side. Not because anything was broken, but because it had been designed for a time when data volumes were smaller and demands were lighter.
Things have changed since then. The scale is bigger, the expectations are higher, and the engine needs to evolve to keep up.
How we approached it: component by component
The methodology mattered as much as the fixes. Before writing a single line of improvement code, we isolated every major component of the pipeline and measured its individual throughput ceiling. Think of it like diagnosing a road network: you don’t fix the motorway until you know which junction is the bottleneck.
A Hevo ingestion pipeline has two major stages:
| Stage 1 – Source reader: Reads from the external source, handles network interaction, and converts data into Hevo’s internal format. Source-specific. Optimisations here are per-connector. |
| Stage 2 – Destination mapper: Takes internal-format records, maps to destination types, writes to CSV, compresses, and uploads to staging. This stage is source-agnostic – every connector writing to Snowflake goes through the same code path. Wins here are universal. |
We ran each stage in isolation – synthetic load on Stage 2 to find its raw ceiling, Stage 1 measured with a drop sink, so no downstream noise affected the reading. Once we had individual capacity numbers, we knew exactly where to focus.
What we changed
The mapper: from reactive to pre-compiled
The biggest win came from rethinking how we handle type conversion at write time. The original mapper was reactive – for every row, for every column, it looked up the source type, looked up the destination type, found the right converter, and applied it. Correct. Just not fast at scale.
The compiled mapper builds its conversion logic once, at initialisation, when both schemas are known. Every row after that just follows pre-built instructions. The per-row lookup overhead is eliminated. This change is universal across all destinations.
Serialisation: right-sized for the job
Standard libraries are built for the general case – every format, every edge case, every legacy scenario imaginable. That breadth comes at a cost. We audited our serialisation layer and replaced general-purpose implementations with narrow, purpose-built ones scoped exactly to what our destination actually requires. Less branching. Fewer code paths. The same correctness, at lower overhead.
A library that handles everything is a liability when you only need one thing.
Write path: eliminating conversion overhead
Data moving through a pipeline passes through several transformation layers before it’s written to disk. Over time, each layer boundary can introduce small inefficiencies – redundant conversions, unnecessary passes over the same data, and escape logic that runs on every value regardless of whether it’s needed. We audited the full write path, removed redundant conversions, and replaced multi-pass implementations with single-pass equivalents. Unglamorous yet impactful.
Compression: speed over ratio, CPU freed
On a 4-core machine with one mapper thread, we expected ~25% CPU usage. We were seeing 60–65%. The culprit was the background gzip compression thread – running at maximum compression ratio, using a small buffer that triggered frequent OS kernel calls.
We switched to best-speed compression and increased the buffer size. Network bandwidth wasn’t the bottleneck, so the lower compression ratio cost us nothing. The upload thread CPU dropped from ~35% to ~5%. That freed nearly a full core of headroom we can now use for additional mapper threads.
Parallelisation: balanced by data, not count
Historical loads run across parallel containers. The original distribution logic was object-count-based, which meant containers frequently received very uneven data loads. One container might process 120 GB while another handles 20 GB. The fast container sat idle while the slow one caught up. The worst-distributed container capped true throughput.
| Old distribution – count-based, uneven load Container 1: 120 GB → 55 min Container 2: 30 GB → 18 min (idle 37 min) Container 3: 50 GB → 28 min (idle 27 min) Container 4: 20 GB → 12 min (idle 43 min) Wall clock: 55 min (bottlenecked by Container 1) |
| New distribution – size-aware, balanced load Container 1: ~55 GB → 38 min Container 2: ~55 GB → 39 min Container 3: ~55 GB → 38 min Container 4: ~55 GB → 40 min Wall clock: 40 min (all containers finish together) |
The new approach distributes by source data size in GB, using metadata we already have at pipeline creation time. Containers start together and finish together. Parallelism that actually works like parallelism.
What changed: end-to-end
| Area | Change | Impact |
| Mapper | Pre-compiled type conversion | Conversion logic is built at initialisation. No per-row, per-column lookups. Universal across all destination pipelines. |
| Serialisation | Purpose-built serialisers | Replaced generic implementations with narrow ones scoped to actual destination requirements. Less branching, fewer code paths. |
| Write path | Single-pass processing | Eliminated redundant double-conversion cycles at library boundaries. One pass where multiple passes were happening before. |
| Compression | Speed-optimised compression + larger buffer | Switched compression priority from ratio to speed. Increased buffer size to reduce system call frequency. Upload thread CPU: 35% → 5%. |
| Parallelism | Size-aware container distribution | Loads distributed by data volume, not object count. Containers finish together instead of waiting on the heaviest one. |
| Profiling | Component-level benchmarking | Each pipeline stage isolated and measured independently before any fix code was written. Bottlenecks identified, not assumed. |
What this makes possible isn’t just faster pipelines. It’s a data foundation that can actually keep up with AI. When your sync time drops from hours to minutes, your models aren’t working off yesterday’s data. When your pipeline runs at full CPU utilisation – not idle, not throttled – your warehouse is always current. The teams building AI-powered workflows on top of Hevo’s pipelines shouldn’t have to think about whether the data underneath is fresh enough. That’s our job. This benchmarking work is how we make sure the answer is always yes.
The foundation was solid. We’re building on it – not away from it.
| Note: The benchmarks and optimisations described in this post are based on MS SQL Server pipelines writing to Snowflake. This was our first source through the new methodology. Postgres, MySQL, and BigQuery results will follow as we complete each pass. |
Questions about pipeline performance at your data scale? We’re happy to talk specifics.