According to the World Economic Forum*, by 2025, the world is expected to generate 463 exabytes of data each day. Here are some key daily statistics:

  • 500 million tweets are sent
  • 294 billion emails are sent
  • 4 petabytes of data is created on Facebook
  • 4 terabytes of data is created from each connected car
  • 65 billion messages are sent on WhatsApp
  • 5 billion searches are made

For over a decade, the Hive table format has been a cornerstone of the big data ecosystem, efficiently managing vast amounts of data. However, as data volumes and diversity grow, users encounter challenges with Hive’s outdated directory-based format, such as limited schema evolution, static partitioning, and prolonged planning times due to S3 directory listings.

This article aims to provide a comprehensive understanding of the Iceberg table format, explaining its core concepts, benefits, and practical applications.

What is Apache Iceberg?

Initially developed at Netflix to address challenges with massive tables, Iceberg was open-sourced in 2018 as an Apache Incubator project. Apache Iceberg is an open table format designed for managing petabyte-scale tables. 

It serves as an abstraction layer between the physical data files (such as those written in Parquet or ORC) and their organization into a table structure. This format addresses challenges in data lakes, making data management more efficient and reliable, and becoming a promising replacement for the traditional Hive table format. With over 25 million terabytes of data stored in Hive, migrating to Iceberg is crucial for improved performance and cost efficiency.

Getting started with Iceberg table format

Prerequisites

Before you begin, ensure you have the following installed on your machine:

  • Java Development Kit (JDK) 8 or later
  • Apache Spark 3.0 or later
  • Python 3.6 or later
  • Maven (for building Iceberg)

Installation guide

Step 1: Download Apache Spark

Step 2: Extract the Spark archive

tar -xvf spark-3.1.2-bin-hadoop3.2.tgz

Step 3: Set environment variables

export SPARK_HOME=~/spark-3.1.2-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH

Step 4: Install PySpark

pip install pyspark

Step 5: Clone Apache Iceberg repository

git clone https://github.com/iceberg.git
cd iceberg

Step 6: Build Iceberg using Maven

./gradlew build

Step 7: Create a configuration file for Spark (e.g., ‘spark-defaults.conf’)

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionsExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.spark_catalog.type=hadoop
spark.sql.catalog.spark_catalog.warehouse=/path/to/w

Step 8: Start PySpark with Iceberg

Start PySpark with Iceberg

Creating and querying your first Iceberg table

1. Initialize a SparkSession with Iceberg:

Initialize a SparkSession with Iceberg

2. Create a table:

Create a table

3. Insert data into the table:

Insert data into the table

4. Query the table:

spark.sql("SELECT * FROM spark_catalog.db.table").show()

Features of Iceberg table format

1. Schema evolution

Iceberg provides robust support for schema evolution, ensuring that changes to the table schema are applied without breaking compatibility. It ensures schema changes are isolated and do not cause unintended issues. Each field in the schema is uniquely identified by an ID, allowing field names to be changed without affecting how Iceberg reads data using these IDs.

Examples of modifying table schemas:

#Adding a new column
spark.sql("""
ALTER TABLE spark_catalog.db.table ADD Column new_col STRING
""")
#Deleting an existing column
spark.sql("""
ALTER TABLE spark_catalog.db.table DROP COLUMN new_col
""")
#Renaming an existing column
spark.sql("""
ALTER TABLE sprak_catalog.db.table RENAME COLUMN data TO new_data
""")
#Changing an 'int' column to 'long'
spark.sql("""
ALTER TABLE spark_catalog.db.table ALTER COLUMN id TYPE BIGINT

2. Hidden partitioning and partition evolution

Iceberg’s hidden partitioning feature allows partition specification evolution without disrupting table integrity. This means you can adjust how data is partitioned—changing granularity or partition columns—without table breaks. Unlike rewriting files, this metadata operation lets old and new data coexist. Iceberg achieves this through split planning: executing queries separately for old and new specifications, then merging the results into a cohesive table view.

Hidden partitioning and partition evolution

In this figure, the ‘example_table’ is initially partitioned by month(date) until 2020-01-01, after which it switches to day(date). The old data remains in the previous partition format, while the new data adopts the new format. When the query is executed, Iceberg performs split planning for each partition specification, filtering partitions under both specifications by applying the appropriate transform (month or day) to the date column.

3. Time travel and rollback

The time travel feature of Iceberg allows querying data snapshots at different timestamps and supports easy rollback to previous states, enhancing data auditing, debugging, and reliability. Snapshots capture the state of a table at a specific point in time. Time travel allows users to query historical data by referencing these snapshots.

Snapshot log data can be accessed using Spark:

#Specify the Iceberg table
iceberg_table = "spark_catalog.default.booking_table"
#Load the snapshot log data
snapshot_log_df = spark.read.format("iceberg").load(f"{iceberg_table}.snapshots")

The result will be something like this:

Time travel and rollback result

If you want to rollback your table to an earlier version:

#snapshot_id or timestamp to rollback to
rollback_snapshot_id = '1234567891'
#Rollback the Iceberg table to the identified snapshot
spark.sql(f"ALTER TABLE {iceberg_table} RESTORE SNAPSHOT '{rollback_snapshot_id}'")
#Querying historical data
snapshot_id = "1234567892"
spark.sql(f"SELECT * FROM {iceberg_table}.snapshots WHERE snapshot_id = '{snapshot_id}'")

4. Manifest files

Manifest files contain a list of paths to related data files. Each entry for a data file includes some metadata about the file, including statistics. This helps in efficient query planning and execution.

5. ACID transactions

Iceberg ensures data consistency and supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions, making it reliable for concurrent data operations.

Challenges and Considerations for Iceberg Table Format

Common pitfalls

  • Misconfigured environments: Evolving APIs of Iceberg and compatibility issues with existing systems may pose challenges during adoption.
  • Improper schema management: Managing schema changes effectively is crucial; improper handling can lead to data integrity issues or performance degradation.

Best practices

  • Regularly update Iceberg versions
  • Monitor and optimize query performance

For reliable best practices and guidelines, check the official Apache Iceberg documentation and community resources:

Real-world Applications and Brief Case Studies of Companies using Iceberg

Netflix and Adobe are notable companies that use Iceberg to manage their vast data lakes efficiently.

1. Iceberg at Netflix

Netflix created Iceberg in 2018 to address the performance issues and usability challenges of using Apache Hive tables.

Key Benefits utilized by Netflix:

  1. Schema evolution: The support for schema evolution of Iceberg allows Netflix to modify table schemas without downtime, which is crucial for continuous data availability and evolving business requirements.
  2. Performance optimization: Efficient file pruning and metadata management, helped to improve query performance for Netflix’s large datasets.
  3. Transactional consistency: Netflix relies on Iceberg’s transactional guarantees to ensure data consistency across various data operations, and to maintain data integrity.
  4. Integration with existing workflows: Iceberg integrates seamlessly with Netflix’s existing data processing frameworks, to effectively support batch and streaming data workflows.

Partition Entropy (PE) / File Size Entropy (FSE): Netflix introduced a concept called Partition Entropy (PE) to optimize data processing further. This includes File Size Entropy (FSE), which uses Mean Squared Error (MSE) calculations to manage partition file sizes efficiently.

  • Definition of FSE/MSE: FSE/MSE for a partition is calculated using:

MSE=1N​i=1N(min⁡(Actuali,  Target))2

Where N is the number of files in the partition, Target is the target file size, and Actuali​ is the size of each file.

  • Implementation in workflow: During snapshot scans, Netflix updates the MSE’ for changed partitions using:

MSE’=1N(NMSE+i=1M(min⁡(Actuali, Target))2)

Where M is the number of files added in the snapshot. A tolerance threshold T is applied, skipping further processing if MSE < T2, reducing full partition scans and merge operations.

Netflix integrates PE/FSE calculations seamlessly with the metadata management and optimization capabilities of Iceberg.

2. Iceberg at Adobe: A Case Study

Adobe, a global leader in digital media and marketing solutions, has embraced Apache Iceberg to streamline its data management processes. This case study highlights how Adobe integrated Iceberg into its data infrastructure and the benefits realized from this adoption.

The Challenge

Adobe faced significant challenges with its legacy data management systems, particularly with data consistency, schema evolution, and efficient querying across large datasets. Traditional file formats and table management systems struggled to meet Adobe’s growing data needs, leading to inefficiencies and increased operational overhead.

The Solution

To address these challenges, Adobe implemented Apache Iceberg to manage large analytic datasets. Iceberg provided a robust solution to the data management issues faced by Adobe with its schema evolution, partitioning, and time travel.

Key Benefits

  1. Improved data consistency and reliability: The support for ACID transactions ensured that data operations were consistent and reliable, significantly reducing data corruption issues.
  2. Efficient schema evolution: With Iceberg, Adobe could easily evolve its data schemas without downtime or complex migrations, allowing for more agile and responsive data management.
  3. Optimized query performance: The advanced partitioning and metadata management and optimized query performance, enabled faster and more efficient data retrieval.
  4. Enhanced data governance: The comprehensive metadata layer and time travel capabilities of Iceberg improved data governance, allowing Adobe to audit and revert data changes as needed.

The case study illustrates how innovative features provided by Apache Iceberg significantly enhance data consistency, performance, and governance. The case study can help other organizations facing similar challenges to explore the benefits of Iceberg and consider its adoption in their data infrastructure.

Conclusion

This article introduces Apache Iceberg, guiding readers through its installation and table format to build a solid understanding of efficient data management. It has significantly helped reduce complexities for Data Scientists and Engineers, by streamlining and accelerating the end-to-end data pipeline. The ongoing innovations and improvements in Iceberg, driven by an active and collaborative community, are continually refining its capabilities.

Organizations like Adobe and many others are contributing to the growth and development of Iceberg. This collaborative effort within the larger Apache Open Source community is vital for tackling emerging challenges and introducing new features. As Iceberg continues to evolve, the community’s dedication and contributions ensure that it remains at the forefront of modern data management solutions, offering robust and scalable options for the future.

Follow the Iceberg Official Site for the latest news and more resources.

Schedule a demo with Hevo to automate data pipelines for efficient data movement.

Radhika Gholap
Data Engineering Expert

Radhika has over three years of experience in data engineering, machine learning, and data visualization. She is an expert at creating and implementing data processing pipelines and predictive analysis. Her knowledge of Big Data technologies, Python, SQL, and PySpark helps her address difficult data challenges and achieve excellent results. With a Master's degree in Data Science from Lancaster University, she uses her analytical skills to develop insightful and engaging technical content for the data business.