Amazon S3 Tables: AWS Has Finally Entered the Open Table Format War

The explosion of data from devices, applications, and systems has driven the need for scalable, efficient storage and analytics solutions. Amazon S3, known for its durability and flexibility, evolves further with S3 Tables, enabling businesses to query and analyze massive datasets directly from storage. This innovation eliminates the complexity of traditional infrastructure while powering advanced insights.

In this blog, we’ll uncover the potential of Amazon S3 Tables, exploring their features, setup, integrations, and use cases. Discover how this tool can transform your data management strategy and fuel smarter business decisions, from optimizing costs to enabling seamless analytics.

Table of Contents

What is Amazon S3 Table?

Amazon Simple Storage Service (S3) is a highly scalable, secure, and cost-effective object storage service provided by AWS. Amazon S3 Tables leverage the underlying capabilities of S3 to organize, query, and manage vast datasets.

Amazon S3 Tables refer to datasets stored in S3, organized in a tabular format, enabling query-based access via tools like Amazon Athena and AWS Glue. These tables are ideal for big data analytics and can handle structured, semi-structured, or unstructured data.

Why Should You Care About Amazon S3 Tables?

So you’re wondering: Why bother? Isn’t plain old S3 storage enough? The answer is simple: AWS S3 Table turns your raw storage into a structured, queryable treasure trove. Here are the reasons why:

1. Structure Meets Simplicity

Plain S3 buckets are like a filing cabinet with no folders. Everything’s just thrown in there. S3 Tables bring structure, categorizing your data into rows and columns, making it easy to retrieve.

2. SQL-Like Queries with S3 Table

Ever wish you could use simple SQL commands to talk to your stored data? With S3 Tables, you can. They work with AWS Athena, a service that lets you run SQL queries on your data.

3. Cost-Effective Like Never Before

S3 Tables lets you query only the data you need, saving you a ton of processing costs. Plus, by organizing your data better you avoid scanning large datasets unnecessarily.

4. Better Integration with Other AWS Services

If you’re already using AWS services like SageMaker, EMR, or Redshift, S3 Tables fit right in, creating a smooth workflow for your data processing needs.

5. Unleash Big Data

Whether you’re managing a data lake, crunching analytics, or feeding data into machine learning models, S3 Tables provide the structure for efficient workflows.

Unlock the power of your Amazon S3 data with Hevo’s effortless integration. With Hevo’s no-code platform, you can set up connections in just a few clicks—no technical expertise required. Hevo has helped customers across 45+ countries connect their cloud storage to migrate data seamlessly. Hevo streamlines the process of migrating data by offering:

Seamlessly data transfer between Amazon S3, and 150+ other sources.
Risk management and security framework for cloud-based systems with SOC2 Compliance.
Always up-to-date data with real-time data sync.

Don’t just take our word for it—try Hevo and experience why industry leaders like Whatfix say,” We’re extremely happy to have Hevo on our side.”

Get Started with Hevo for Free

How does Amazon S3 Table Work?

S3 Tables are a combination of various processes such as:

Storage: S3 Tables provide dedicated storage for structured data in Parquet format in S3.
Table Creation: You create tables in a table bucket; they are first-class resources in S3.
Data: Data is stored as Parquet files in S3.
Metadata: S3 manages the metadata to make the Parquet data queryable by your applications.
Permissions: To secure access, table-level permissions can be set using identity or resource-based policies.
Compatibility: Tables are queryable by applications or tools that support the Apache Iceberg standard.
Client Library: A client library is provided to help query engines navigate and update Iceberg metadata in the table.
Data Write and Read: With new S3 APIs, multiple clients can safely read and write to the tables.
Data Optimization: S3 will compact the Parquet data over time to improve query performance and reduce costs.

How Can You Create an Amazon S3 Table?

You can create a S3 table using the following steps:

Step 1: Create a table bucket and integrate it with AWS analytics services

Sign in to your AWS Management Console and go to the Amazon S3 console.
At the top of the page, click on the current AWS Region and select the one where you want to create the bucket.
In the left menu, click on Buckets.
Click Create Bucket.
In the Properties section, enter a unique name for your bucket. The name should:
- Be unique within your AWS account in the selected region.
- Be between 3 and 63 characters.
- Only include lowercase letters, numbers, and hyphens.
- Start and end with a letter or number.
Finally, click Create Bucket.

Step 2: Create an Amazon EMR cluster and launch a Spark session

Note: For this step, you use the AWS CLI to launch an Amazon EMR cluster with Iceberg installed.

Create a cluster with the following configuration.

aws emr create-cluster --release-label emr-7.5.0 \

--applications Name=Spark \

--configurations file://configurations.json \

--region us-east-1 \

--name My_Spark_Iceberg_Cluster \

--log-uri s3://amzn-s3-demo-bucket/ \

--instance-type m5.xlarge \

--instance-count 2 \

--service-role EMR_DefaultRole_V2 \

--ec2-attributes \

InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0

Here, replace the user input placeholder values with your own.

Make the following changes to configurations.json:

[{

"Classification":"iceberg-defaults",

"Properties":{"iceberg.enabled":"true"}

}]

Connect to Spark primary node using SSH.
Enter the following command to launch the Spark shell, and initialize a Spark session for Iceberg that connects to your table bucket.

spark-shell \

--packages software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.3 \

--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \

--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \

--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket1 \

--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Replace the user input placeholder value with your table bucket ARN.

Integrate Amazon S3 to Azure Synapse Analytics

Get a Demo Try it

Integrate Amazon S3 to BigQuery

Get a Demo Try it

Integrate Amazon S3 to Databricks

Get a Demo Try it

Step 3: Create a table and upload the data.

Use the following Spark SQL command to create a new namespace.

spark.sql("CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.example_namespace")

Create a new Iceberg table.

spark.sql( 

""" CREATE TABLE IF NOT EXISTS s3tablesbucket.example_namespace.`example_table` ( 

    id INT, 

    name STRING, 

    value INT 

) 

USING iceberg """

)

Load data into the table using the INSERT command.

spark.sql(

"""

    INSERT INTO s3tablesbucket.example_namespace.example_table 

    VALUES 

        (1, 'ABC', 100), 

        (2, 'XYZ', 200)

""")

Step 4: Query data with SQL

You can query the table within your Spark session or by using supported AWS analytics engines. The following is a sample Spark query:

spark.sql(""" SELECT *

FROM s3tablesbucket.my_namespace.`my_table` """).show()

Note: For more information, see Athena and Amazon Redshift.

S3 Table Best Practices: The Secret Sauce to Best Results

Follow these best practices to get the most out of your AWS S3 Table and achieve the best results:

1. Partition

Partitioning breaks your data into smaller pieces, making queries faster and cheaper. For example, partitioning by year and month will speed up date-specific queries if you’re storing sales data.

2. Use the Right File Format

File formats like Parquet and ORC are designed for analytics workloads. They compress data and speed up queries.

3. Automate with Glue

Use AWS Glue Jobs to automate schema updates, data transformations, and cleanup tasks. This keeps your tables up to date without manual intervention.

4. Monitor and Optimize

Use Cost Explorer to see how much you’re spending. Check query logs to find expensive queries and optimize by filtering or limiting data scans.

5. Secure

Leverage S3’s robust security features, such as encryption, bucket policies, and IAM roles, to protect your data.

How to Maintain AWS S3 Tables?

S3 offers various maintenance options for enhancing the performance of your table.

1. Compaction

Compaction in Iceberg combines smaller files into larger ones to enhance query performance and applies row-level deletes. On Amazon S3, compaction creates files based on an optimal or specified target size (default: 512MB). The compacted files form the latest table snapshot and are enabled by default.

To configure the compaction target file size by using the AWS CLI, run the following command:

aws s3tables put-table-maintenance-configuration \

   --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket1 \

   --type icebergCompaction \

   --namespace mynamespace \

   --name testtable \

   --value='{"status":"enabled","settings":{"icebergCompaction":{"targetFileSizeMB":512}}}'

2. Snapshot Management

Snapshot management determines the number of active snapshots for your table. It Controls active table snapshots using default settings: MinimumSnapshots (1) and MaximumSnapshotAge (120 hours). To configure the snapshot management by using the AWS CLI, run the following command:

aws --region us-west-2 s3tables put-table-maintenance-configuration \

--table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket \

--namespace my_namespace \

--name my_table \

--type icebergCompaction \

--value '{"status":"enabled","settings":{"icebergSnapshotManagement":{"minSnapshotsToKeep":10,"maxSnapshotAgeHours":2500}}}'

Real-world Applications of Amazon S3 Tables

1. Big Data Analytics

Companies drowning in terabytes of customer, sales, or operational data find S3 Tables to be a game-changer. It allows data analysts to query massive datasets quickly, generating patterns, trends, and insights that drive more intelligent decision-making. For efficient management of objects at scale, S3 Batch Operations can perform actions on millions of items at once, each identified by its unique S3 key.

Example: a retail giant might use S3 Tables to analyze purchasing trends during the holiday season, which would help optimize inventory and marketing strategies.

2. Data Lakes

A data lake is essentially a large raw data repository in multiple formats. Data lakes can easily become “data swamps” if improperly structured. S3 Tables bring order to the chaos, making data fast and efficient to access. As a highly scalable data lake, Amazon S3 is a critical component for modern data architectures, facilitating the export of table data and serving as a staging area to load that data into powerful platforms like Redshift, Snowflake, and BigQuery for analytics, or into Elasticsearch for full-text search.

Example: A health company might use S3 Tables to structure and query patient data for research purposes but preserve the raw data for compliance.

3. Machine Learning Pipelines

Machine learning models thrive on clean, structured data. S3 Tables help streamline the process of preparing data for ML pipelines, thus saving time and effort in preprocessing.

Example: A fintech startup might use S3 Tables to organize transactional data before feeding it into a fraud detection model.

4. Log Analysis

Web servers, applications, and systems write logs are typically kept in S3. S3 Tables enable simple analysis of these logs for performance monitoring, troubleshooting, or security auditing.

Example: A SaaS provider can use S3 Tables to identify and rectify bottlenecks within its application using log data.

Summing It Up

Amazon S3 Tables are more than just a feature—they’re a game-changer for anyone looking to simplify and supercharge their data storage and analysis workflows. They unlock incredible potential for querying, analyzing, and managing data at scale. Whether you’re a data engineer managing a data lake, a business analyst running analytics, or a developer building machine learning pipelines, S3 Tables are your go-to solution for cost-efficient, scalable, and high-performance data operations.

With tools like Hevo, you can take things further. Hevo’s no-code platform seamlessly integrates Amazon S3 with your favorite data warehouses, automating workflows and ensuring smooth data pipelines. Together, S3 and Hevo help you save time, reduce costs, and maximize the value of your data.

Ready to transform your data workflows? Dive into the world of S3 Tables and experience effortless data management today! Sign up for Hevo’s 14-day free trial and experience seamless S3 data integration.

FAQ on AWS S3 Tables

Q1: Are S3 Tables a replacement for databases?

No, S3 Tables complement databases. They’re ideal for analytical workloads and large-scale data storage but aren’t designed for transactional use cases like traditional databases.

Q2: Do I need to know SQL to use S3 Tables?

While SQL knowledge helps, tools like AWS Glue and Athena offer user-friendly interfaces, making it accessible even for non-technical users.

Q3: How do S3 Tables save costs?

By letting you query only what you need, S3 Tables minimize data scanning and processing costs. File compression and partitioning further enhance cost efficiency.

Suraj Poddar Principal Frontend Engineer, Hevo Data

Suraj has over a decade of experience in the tech industry, with a significant focus on architecting and developing scalable front-end solutions. As a Principal Frontend Engineer at Hevo, he has played a key role in building core frontend modules, driving innovation, and contributing to the open-source community. Suraj's expertise includes creating reusable UI libraries, collaborating across teams, and enhancing user experience and interface design.

Amazon S3 Tables: AWS Has Finally Entered the Open Table Format War

What is Amazon S3 Table?

Why Should You Care About Amazon S3 Tables?

1. Structure Meets Simplicity

2. SQL-Like Queries with S3 Table

3. Cost-Effective Like Never Before

4. Better Integration with Other AWS Services

5. Unleash Big Data

How does Amazon S3 Table Work?

How Can You Create an Amazon S3 Table?

Step 1: Create a table bucket and integrate it with AWS analytics services

Step 2: Create an Amazon EMR cluster and launch a Spark session

Step 3: Create a table and upload the data.

Step 4: Query data with SQL

S3 Table Best Practices: The Secret Sauce to Best Results

1. Partition

2. Use the Right File Format

3. Automate with Glue

4. Monitor and Optimize

5. Secure

How to Maintain AWS S3 Tables?

1. Compaction

2. Snapshot Management

Real-world Applications of Amazon S3 Tables

1. Big Data Analytics

2. Data Lakes

3. Machine Learning Pipelines

4. Log Analysis

Summing It Up

FAQ on AWS S3 Tables

Q1: Are S3 Tables a replacement for databases?

Q2: Do I need to know SQL to use S3 Tables?

Q3: How do S3 Tables save costs?

Related Articles

Optimize your data integration with Hevo!

Related articles