Real-time data requires processing and analytics tasks to be performed within seconds. Organizations often struggle to store and process this data quickly enough. High throughput, large volumes, different data formats, and a lot more need to be addressed. So, companies should find a way to effectively manage these factors while processing real-time data.
In this article, we’ll see what Apache Iceberg is and how it suits streaming data.
Challenges in Using Data Lake for Streaming Data
A data lake is a central repository that lets you store and process large amounts of data. This stores the data in its original format, whether structured or unstructured. However, there are challenges associated with handling streaming data.
- Bulk updates: Data lakes store tremendous amounts of data, and companies want to update this data frequently. As the data is stored in raw format, ensuring data consistency during updates isn’t easy.
Especially in streaming use cases, data updates occur so often, which becomes a pain point without a structured storage system.
- Small file problem: Streaming data involves frequent updates. Imagine creating a small file each time data is ingested—every minute or so in real-time streaming.
This results in millions of files adding up to your data lake every day. After a few months, it takes forever to extract and perform analytics on these millions of files.
- Schema evolution: Streaming data doesn’t always come in a consistent format. It demands flexible schema evolution while maintaining data consistency.
Schema evolution means updating the current schema to accommodate changing data over time. Unfortunately, data lakes with unstructured storage systems lack schema evolution support.
These challenges make unstructured data lakes inefficient for high-scale streaming data. That’s where the Apache Iceberg comes to the rescue. Let’s explore what it is and its benefits for streaming data.
What is the Apache Iceberg?
Apache Iceberg isn’t an execution engine like Apache Spark, a database software like MySQL, or a data storage repository like data lake. Iceberg is a table format that integrates with data lake and provides structured format to your raw data files.
The Iceberg format includes schema evolution, partitioning, and version rollback features within the data lake. Its table format makes it easy for query and execution engines to smoothly interact with the data and perform necessary operations.
Netflix created Iceberg to address the performance and other challenges encountered with Apache Hive tables. In 2018, the company donated Iceberg to the Apache foundation to make it open source and accessible to all organizations.
Transform your data pipeline with Hevo’s no-code platform, designed to seamlessly transfer data from Apache Kafka to over destinations such as BigQuery, Redshift, Snowflake, and many others. Hevo ensures real-time data flow without any data loss or coding required.
Why Choose Hevo for Kafka Integration?
- Simple Setup: Easily set up data pipelines from Kafka to your desired destination with minimal effort.
- Real-Time Syncing: Stream data continuously to keep your information up-to-date.
- Comprehensive Transformations: Modify and enrich data on the fly before it reaches your destination.
Let Hevo handle the integrations.
Get Started with Hevo for Free
Why Apache Kafka?
Data lakes are well for warehousing big volumes of structured and unstructured data. However, data lakes are really challenged with streaming data because frequent updates, the management of smaller files and schema evolution do not make it efficient for high-scale streaming. Apache Kafka addresses these challenges effectively
1.Bulk updates and data consistency: Due to this kind of log-based structure, Kafka produces real-time processing and consistent updates with seamless event streaming and reliable data consistency.
2. Small File Problem and Schema Evolution: Kafka mitigates the problem of small file by accumulating streams into partitions and has a schema registry in order to support schema evolution handling changes in the format of the data elegantly.
Data lakes can now efficiently handle very high-scale streaming data through the incorporation of Kafka.
Apache Iceberg’s Approach to Streaming Data
Let’s first understand the two popular paradigms that Iceberg supports: MoR (Merge-on-Read) and CoW (Copy-on-Write). Then, we can conclude the right approach for streaming data use cases.
1. CoW (Copy-on-Write): In this approach, every time a modification is made, the entire data file is rewritten with the updated information, even if it involves updating just a single row.
This makes sense when updating a large number of records. However, rewriting the entire file just to update a few rows makes small and frequent changes expensive.
Before diving into the MoR(Merge-on-Read) approach, you should know what delete files are. Delete files log all the records that are logically deleted in the dataset so they can be ignored while querying the table.
2. MoR (Merge-on-Read): In MoR, the entire data table is not rewritten. Instead, a delete file is generated. This delete file keeps track of all the deleted records and ensures our queries skip these entries at the time of execution.
This works well for deleting data from a table. But what if you want to update rows in the table?
In such a case, the rows to be updated are listed in the delete file. A new data file is created with just the updated records. During query execution, the updated records are fetched from this new data file, while unchanged records are read from the original dataset.
Overall, in MoR, writing to files is faster as it simply writes the updates to the delete file, not rewriting the entire data file. This makes it suitable for streaming data where frequent updates are required.
So, it follows the Merge-on-Read (MoR) approach, making it highly adaptable for streaming data use cases.
Integrate Kafka to BigQuery
Integrate Kafka to Databricks
Integrate Kafka to Redshift
Benefits of Using Iceberg for Streaming Data Use Cases
Iceberg table format has various benefits over some random raw format, especially in the case of streaming data. Let’s have a look at what this format brings to the table.
Schema evolution
Streaming data is dynamic with frequent schema updates. If your data format doesn’t support schema evolution, schema changes can cause downtime, interrupt data flow, and lead to data loss.
Schema evolution means adapting the latest schema without affecting ongoing operations on the data. It supports schema evolution, making it highly suitable for streaming data.
ACID compliance
Many organizations use streaming data for real-time decision-making. Streaming data compliance with ACID properties ensures that your data is accurate and consistent, allowing you to make confident, instant business decisions. This is crucial for critical transactions, like finance data.
Time travel
Time travel allows you to investigate if something went wrong by digging into the data table at different points in time. Iceberg generates a snapshot every time you make a change to the table.
The snapshot stores the previous state of your table. This way, Iceberg maintains the history of table snapshots so you can specify the timestamp and view the table of that time.
This feature is helpful for data recovery, audit, and compliance purposes. Say you accidentally deleted crucial records from a table. You can instantly roll back to the previous snapshot and restore your original data.
Query performance
Iceberg format maintains up-to-date metadata and partition statistics files that contain statistical details about partitions in the data.
The query engine uses this information for efficient query planning. For example, if the engine knows the number of records in a partition ahead of time, it can allocate resources accordingly.
Reduced number of files
Each transaction made to a table generates a new file. As more changes are made to Iceberg tables, more files are created, which slows down queries as they need to scan more files.
Apache Iceberg supports automatic compaction to address this issue. Compaction is a technique to merge small files into a few large files, reducing the total number of files.
This is particularly helpful for streaming data, as frequent updates generate a lot of small files. Compacting these into larger files makes your queries faster.
Step-by-Step Guide for Streaming Data into Iceberg Tables
Apache Kafka, Flink, and Iceberg are the three main tools for ingesting data from source to table format.
- Apache Kafka: The source for the streaming data, meaning you’ll read the data from here.
- Apache Flink: This processing framework supports both batch and stream processing.
- Apache Iceberg: The processed data is stored in the table format.
Once you install and set up the above tools in your machine, you can build streaming data pipelines. We’ll show a sample code format for streaming data to Iceberg tables using Flink.
Step 1: Set up the Flink environment to use the DataStream API.
The settings variable in the code snippet below sets the execution to streaming mode.
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().build();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
Now set up the Kafka connection properties.
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "my-group");
Step 2: Use Flink-Kafka connector (KafkaSource) to read data from Kafka topics
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers("localhost:9092")
.setTopics("input-kafka-topic")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
Step 3: Create a Kafka table with the structure of streaming data.
tEnv.executeSql(
"create table kafka_table (" +
" user_id BIGINT, " +
" behavior STRING, " +
" `timestamp` TIMESTAMP " +
") WITH (" +
" 'connector' = 'kafka'," +
" 'topic' = 'input-kafka-topic'," +
" 'properties.bootstrap.servers' = 'localhost:9092'," +
" 'properties.group.id' = 'my-group'," +
" 'format' = 'json'" +
")"
);
Step 4: Create Iceberg tables where you want to write the data.
tEnv.executeSql(
"CREATE TABLE iceberg_table (" +
" user_id BIGINT, " +
" behavior STRING, " +
" `timestamp` TIMESTAMP" +
") WITH (" +
" 'connector' = 'iceberg'," +
" 'catalog-name' = 'default'," +
" 'database-name' = 'db'," +
" 'table-name' = 'table'" +
")"
);
Step 5: Insert data into the Iceberg table.
TableResult output = tEnv.executeSql("INSERT INTO iceberg_table SELECT * FROM kafka_table");
With the above approach, practical coding experience with multiple tech stacks is essential to build streaming data pipelines. However, with platforms like Hevo data, both technical and business users can automate data movement tasks between different platforms.
For those who don’t know, Hevo Data is a no-code, unified data platform that automatically syncs data from all your sources to the warehouse. This platform lets you load data from 150+ distinct sources to target destinations.
Try Hevo for free and automate the data movements from your source to destination.
Seamlessly load data from Kafka to BigQuery
No credit card required
Iceberg VS Traditional Catalogs
Feature | Iceberg | Traditional Catalogs |
Schema Management | Supports schema evolution without rewriting data | Limited support for schema evolution |
ACID Transactions | Full ACID compliance (atomicity, consistency, isolation, durability) | Limited or no ACID support, leading to potential inconsistencies |
Performance | Optimized for large-scale queries with minimal overhead | Performance can degrade with large datasets |
Metadata Management | Efficient metadata storage and management | Metadata stored inefficiently, often leading to slower performance |
Use Case | Ideal for modern cloud data lakes with large datasets | Suitable for smaller, simpler datasets |
Conclusion
Since 2018, Iceberg has become a go-to storage format for many organizations, including Apple, Adobe, and Airbnb. These organizations deal with tons of real-time data. Iceberg not only resolves their performance issues with handling such huge data, but also helps in data governance.
In this blog, you’ve also learned about the benefits of using Iceberg for streaming data and a sample code demonstration to stream data from source to Iceberg tables.
Get a demo with Hevo to simplify data integration. Also, check out Hevo’s unbeatable pricing or get a custom code for your requirements.
FAQs
- Why is Apache Iceberg better than Hive?
It is an open table format designed to overcome Hive’s limitations. It provides schema evolution, time travel via snapshot isolation, and a consistent view of real-time data.
- Does Apache Iceberg support streaming?
Yes, Iceberg supports both batch and stream processing through Apache Flink’s DataStream API and Table API.
- What companies use Apache Iceberg?
Many industry giants like Apple, Airbnb, Netflix, Cloudera, and Adobe use Iceberg in their data lake architecture.
Srujana is a seasoned technical content writer with over 3 years of experience. She specializes in data integration and analysis and has worked as a data scientist at Target. Using her skills, she develops thoroughly researched content that uncovers insights and offers actionable solutions to help organizations navigate and excel in the complex data landscape.