Real-time data requires processing and analytics tasks to be performed within seconds. Organizations often struggle to store and process this data quickly enough. High throughput, large volumes, different data formats, and a lot more need to be addressed. So, companies should find a way to effectively manage these factors while processing real-time data. 

In this article, we’ll see what Apache Iceberg is and how it suits streaming data. 

Challenges in Using Data Lake for Streaming Data

A data lake is a central repository that lets you store and process large amounts of data. This stores the data in its original format, whether structured or unstructured. However, there are challenges associated with handling streaming data.

  • Bulk updates: Data lakes store tremendous amounts of data, and companies want to update this data frequently. As the data is stored in raw format, ensuring data consistency during updates isn’t easy. 

Especially in streaming use cases, data updates occur so often, which becomes a pain point without a structured storage system.

  • Small file problem: Streaming data involves frequent updates. Imagine creating a small file each time data is ingested—every minute or so in real-time streaming

This results in millions of files adding up to your data lake every day. After a few months, it takes forever to extract and perform analytics on these millions of files.

  • Schema evolution: Streaming data doesn’t always come in a consistent format. It demands flexible schema evolution while maintaining data consistency. 

Schema evolution means updating the current schema to accommodate changing data over time. Unfortunately, data lakes with unstructured storage systems lack schema evolution support.

These challenges make unstructured data lakes inefficient for high-scale streaming data. That’s where the Apache Iceberg comes to the rescue. Let’s explore what it is and its benefits for streaming data. 

What is the Apache Iceberg?

Apache Iceberg isn’t an execution engine like Apache Spark, a database software like MySQL, or a data storage repository like data lake. Iceberg is a table format that integrates with data lake and provides structured format to your raw data files. 

The Iceberg format includes schema evolution, partitioning, and version rollback features within the data lake. Its table format makes it easy for query and execution engines to smoothly interact with the data and perform necessary operations.

Netflix created Iceberg to address the performance and other challenges encountered with Apache Hive tables. In 2018, the company donated Iceberg to the Apache foundation to make it open source and accessible to all organizations.

Apache Iceberg’s Approach to Streaming Data

Let’s first understand the two popular paradigms that Iceberg supports: MoR (Merge-on-Read) and CoW (Copy-on-Write). Then, we can conclude the right approach for streaming data use cases.

1. CoW (Copy-on-Write): In this approach, every time a modification is made, the entire data file is rewritten with the updated information, even if it involves updating just a single row. 

This makes sense when updating a large number of records. However, rewriting the entire file just to update a few rows makes small and frequent changes expensive. 

Before diving into the MoR(Merge-on-Read) approach, you should know what delete files are. Delete files log all the records that are logically deleted in the dataset so they can be ignored while querying the table.  

2. MoR (Merge-on-Read): In MoR, the entire data table is not rewritten. Instead, a delete file is generated. This delete file keeps track of all the deleted records and ensures our queries skip these entries at the time of execution. 

This works well for deleting data from a table. But what if you want to update rows in the table?

In such a case, the rows to be updated are listed in the delete file. A new data file is created with just the updated records. During query execution, the updated records are fetched from this new data file, while unchanged records are read from the original dataset.

Overall, in MoR, writing to files is faster as it simply writes the updates to the delete file, not rewriting the entire data file. This makes it suitable for streaming data where frequent updates are required.

So, it follows the Merge-on-Read (MoR) approach, making it highly adaptable for streaming data use cases.

Benefits of Using Iceberg for Streaming Data Use Cases

Iceberg table format has various benefits over some random raw format, especially in the case of streaming data. Let’s have a look at what this format brings to the table. 

Schema evolution 

Streaming data is dynamic with frequent schema updates. If your data format doesn’t support schema evolution, schema changes can cause downtime, interrupt data flow, and lead to data loss. 

Schema evolution means adapting the latest schema without affecting ongoing operations on the data. It supports schema evolution, making it highly suitable for streaming data. 

ACID compliance 

Many organizations use streaming data for real-time decision-making. Streaming data compliance with ACID properties ensures that your data is accurate and consistent, allowing you to make confident, instant business decisions. This is crucial for critical transactions, like finance data.

Time travel

Time travel allows you to investigate if something went wrong by digging into the data table at different points in time. Iceberg generates a snapshot every time you make a change to the table. 

The snapshot stores the previous state of your table. This way, Iceberg maintains the history of table snapshots so you can specify the timestamp and view the table of that time.

This feature is helpful for data recovery, audit, and compliance purposes. Say you accidentally deleted crucial records from a table. You can instantly roll back to the previous snapshot and restore your original data.

Query performance

Iceberg format maintains up-to-date metadata and partition statistics files that contain statistical details about partitions in the data. 

The query engine uses this information for efficient query planning. For example, if the engine knows the number of records in a partition ahead of time, it can allocate resources accordingly.

Reduced number of files

Each transaction made to a table generates a new file. As more changes are made to Iceberg tables, more files are created, which slows down queries as they need to scan more files. 

Apache Iceberg supports automatic compaction to address this issue. Compaction is a technique to merge small files into a few large files, reducing the total number of files. 

This is particularly helpful for streaming data, as frequent updates generate a lot of small files. Compacting these into larger files makes your queries faster.

Step-by-Step Guide for Streaming Data into Iceberg Tables

Apache Kafka, Flink, and Iceberg are the three main tools for ingesting data from source to table format.

  • Apache Kafka: The source for the streaming data, meaning you’ll read the data from here.
  • Apache Flink: This processing framework supports both batch and stream processing.
  • Apache Iceberg: The processed data is stored in the table format. 

Once you install and set up the above tools in your machine, you can build streaming data pipelines. We’ll show a sample code format for streaming data to Iceberg tables using Flink.

Step 1: Set up the Flink environment to use the DataStream API.

The settings variable in the code snippet below sets the execution to streaming mode.

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().build();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);

Now set up the Kafka connection properties.

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "my-group");

Step 2: Use Flink-Kafka connector (KafkaSource) to read data from Kafka topics

KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers("localhost:9092")
    .setTopics("input-kafka-topic")
    .setGroupId("my-group")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setValueOnlyDeserializer(new SimpleStringSchema())
    .build();

Step 3: Create a Kafka table with the structure of streaming data.

tEnv.executeSql(
        "create table kafka_table (" +
        "  user_id BIGINT, " +
        "  behavior STRING, " +
        "  `timestamp` TIMESTAMP " +
        ") WITH (" +
        "  'connector' = 'kafka'," +
        "  'topic' = 'input-kafka-topic'," +
        "  'properties.bootstrap.servers' = 'localhost:9092'," +
        "  'properties.group.id' = 'my-group'," +
        "  'format' = 'json'" +
        ")"
);

Step 4: Create Iceberg tables where you want to write the data.

tEnv.executeSql(
        "CREATE TABLE iceberg_table (" +
        "  user_id BIGINT, " +
        "  behavior STRING, " +
        "  `timestamp` TIMESTAMP" +
        ") WITH (" +
        "  'connector' = 'iceberg'," +
        "  'catalog-name' = 'default'," +
        "  'database-name' = 'db'," +
        "  'table-name' = 'table'" +
        ")"
);

Step 5: Insert data into the Iceberg table.

TableResult output = tEnv.executeSql("INSERT INTO iceberg_table SELECT * FROM kafka_table");

With the above approach, practical coding experience with multiple tech stacks is essential to build streaming data pipelines. However, with platforms like Hevo data, both technical and business users can automate data movement tasks between different platforms. 

For those who don’t know, Hevo Data is a no-code, unified data platform that automatically syncs data from all your sources to the warehouse. This platform lets you load data from 150+ distinct sources to target destinations. 

Try Hevo for free and automate the data movements from your source to destination.

Iceberg VS Traditional Catalogs 

As the world shifted to cloud-based systems, organizations adapted cloud data lakes for their anytime and from anywhere accessibility functionality.

Amazon storage system S3 is a popular example of a cloud data lake. Each dataset in these storage systems is organized as a collection of files within a directory. 

However, they lack schema definitions for columns, data types, partition information, and ACID properties.

That’s when meta catalogs come into the picture. Meta catalogs define the dataset location, columns, data types, and more. Hive is one of the most popular catalogs that stores the dataset schema and structure.

Apache Iceberg vs Catalogs

Though these catalogs make dataset structure consistent across applications accessing the data, they don’t handle data changes and schema evolution. 

For instance, if an application adds or removes files in a dataset, another application accessing it simultaneously will have an older or inconsistent view of the meta catalog.

Moreover, the Hive catalog doesn’t retain any historical changes made. It only describes the current dataset schema, while Iceberg stores the complete history of tables. Iceberg’s time travel feature maintains the snapshot of the table every time a change is made.

Overall, Iceberg combines the schema evolution, partitioning, and time travel capabilities of traditional databases with the scalability and flexibility of data lake architecture.

Conclusion

Since 2018, Iceberg has become a go-to storage format for many organizations, including Apple, Adobe, and Airbnb. These organizations deal with tons of real-time data. Iceberg not only resolves their performance issues with handling such huge data, but also helps in data governance.

In this blog, you’ve also learned about the benefits of using Iceberg for streaming data and a sample code demonstration to stream data from source to Iceberg tables. 

Get a demo with Hevo to simplify data integration. Also, check out Hevo’s unbeatable pricing or get a custom code for your requirements.

FAQs

  1. Why is Apache Iceberg better than Hive?

It is an open table format designed to overcome Hive’s limitations. It provides schema evolution, time travel via snapshot isolation, and a consistent view of real-time data.

  1. Does Apache Iceberg support streaming?

Yes, Iceberg supports both batch and stream processing through Apache Flink’s DataStream API and Table API.

  1. What companies use Apache Iceberg?

Many industry giants like Apple, Airbnb, Netflix, Cloudera, and Adobe use Iceberg in their data lake architecture.

Srujana Maddula
Technical Content Writer

Srujana is a seasoned technical content writer with over 3 years of experience. She specializes in data integration and analysis and has worked as a data scientist at Target. Using her skills, she develops thoroughly researched content that uncovers insights and offers actionable solutions to help organizations navigate and excel in the complex data landscape.