Automatic Schema Management – Cornerstone of Modern Data Pipeline

By: Published: January 31, 2022

Automatic Schema Management

The emergence of a Modern Data Pipeline has made it significantly easier for companies to integrate data from multiple sources into a unified data warehouse. Thus, enabling companies to be data-driven while saving time and money.

One of the critical factors that make a Modern Data Pipeline maintenance-free and robust is the automation of Schema Management. The Schema Management process is where you define how you want to store data in the destination warehouse.

It refers to 2 functions in a data pipeline:

  1. Schema creation: Before ingesting data from any data source, the schema of that source has to be created in the destination warehouse. The data ingestion process can begin only after mapping the source schema to the destination.
  2. Schema evolution: As you change your data sources or upgrade your data stack, the source schema is bound to change. For a data pipeline to function correctly, the destination schema must be continuously updated to mirror changes made in the source schema.

Why Schema Management Should be Automated

Schema Management is the bridge between your Source and Destination data, making it a critical component of a data pipeline.

If Schema Management is handled manually, data teams waste a lot of time building data pipelines and then keeping up with future schema changes.

Manual Schema Management results in frequent data downtime resulting in delays in the entire data-driven decision-making process.

As the flow of data isn’t constant, it is nearly impossible to power data applications like live dashboards that need fresh or real-time data. Extracting meaningful insights from data takes a lot longer and requires a lot of manual intervention.

The Modern Data Pipeline Approach to Schema Management

Example of how an automapper looks like

In a Modern Data Pipeline, Schema Management is fully automated. The Automapper creates and updates your Destination schema based on the Source schema without manual intervention.

As a result of Automatic Schema Management, data engineers save time spent on building and maintaining data pipelines, freeing them up for more mission-critical tasks.

End users and data analysts don’t have to face constant data downtimes, enabling them to power data applications based on the latest data.

Consequently, data time-to-value is significantly reduced at an organizational level. Data-driven organizations can accomplish more with their data, faster – making Automatic Shema Management the cornerstone of a Modern Data Pipeline.

Hassle-free Schema Creation

The Modern Data Pipeline automatically detects the source schema and replicates all objects and fields in your Destination, regardless of the complexity. Manual intervention isn’t required, and the data replication process begins instantly.

Automatic Schema Evolution – Zero Data Downtime Due to Schema Changes

Post initial setup, the Modern Data Pipeline detects all incoming changes in the Source schema and alters the Destination schema to reflect those changes. This ensures that you face zero data downtime due to schema changes and the continuous flow of data to the warehouse.

Following is how Automapper reacts to various scenarios,

1. If a new field is added to an existing object, the Automapper creates the respective column in the Destination

2. If a field is deleted from an existing object, the Automapper doesn’t load any data further in the respective column.

3. If the data type is changed for a field, the Automapper automatically promotes or changes the Destination column’s data type to accommodate the change.

4. If a new object is added, the Automapper creates and maps the object in the Destination and initiates the data replication process.

5. If incoming JSON data format changes, the Automapper will add fields to the Destination Object to accommodate new values. Missing values will be skipped.

Ensures zero data loss

Automatic Schema Management features can resolve incompatible schema mappings in most situations, but there might be scenarios where it cannot do so.

In the case of Hevo, we park the impacted events aside, and you can intervene to resolve the error. This feature serves as a fail-safe, so you don’t lose any data.

Offers control

Depending on your data needs, Hevo also offers you the ability to alter the destination schema so you can:

  • Drop certain Objects
  • Drop specific fields in Objects
  • Change field data type
  • Change Object names

Here is how automated schema management has benefited our customers

We were spending almost 20 hours a week on writing and managing custom scripts, and still we were facing a lot of issues with missing data, inefficient queries, inability to track schema changes, and leverage this data for real-time insights.

Ankit Singh, Groww

The automapping functionality is essential as schema changes in our data sources are so frequent that without it, our pipeline would break every other day!

Juan Ramos, Ebury

Take a look at how Ebury and Groww unlocked their data potential with Hevo.

Automatic Schema Management with Hevo

Automatic Schema Management addresses some of the foundational problems Data Teams face with a Traditional Data Pipeline. If you’re spending too much time building and maintaining data pipelines, now is an excellent time to try out Hevo.