Amazon S3 is a prominent data storage platform with multiple storage and security features. Integrating data stored in Amazon S3 to a data warehouse like Databricks can enable better data-driven decisions. As Databricks offers a collaborative environment, you can quickly and cost-effectively build machine-learning applications with your team.

Integrating data from Amazon S3 to Databricks makes it easier to build these ML applications, as Databricks provides an interactive notebook option. With data present on Databricks, you can deploy your application on multiple platforms, including Azure, AWS, and GCP.

This article will explore six popular methods to connect Amazon S3 to Databricks.

Why Migrate Amazon S3 to Databricks?

Migrating data from Amazon S3 to Databricks can have many advantages.

  • With this integration, you can have an additional data backup, ensuring data protection and security while sharing data over different platforms using Delta Sharing.
  • Moving data from Amazon S3 to Databricks can enable you to use features like filtering, cleaning, and performing advanced analytics on data.
  • Databricks provides a data warehousing platform for quickly performing complex queries. Features like Mosaic AI enable machine-learning capabilities for model training, feature management, and model serving.
  • Integrating Amazon S3 with Databricks can help you process your data faster. Databricks provide functionality like multiple parallel processing that can improve performance.
  • Moving data from Amazon S3 to Databricks enables users to manage big data with Apache Spark’s help, as Databricks has native integration with Apache Spark.

An Overview of Amazon S3

Amazon S3 is a storage service on the AWS cloud ecosystem that offers availability, security, scalability, and high performance. It enables you to store and protect data that you can use for any use case, including building data lakes, mobile applications, and cloud-native applications.

Using Amazon S3, you can manage large volumes of data, organize it, and set up fine-tuned access control to meet business needs. It allows the building of high-performance computing applications that can cost-effectively maximize performance. To stream your data to Amazon S3, you can refer to real-time data streaming to S3.

An Overview of Databricks

Databricks is a data lakehouse platform that enables you to perform SQL and performance enhancement tasks, including caching, indexing, and massively parallel processing (MPP). It provides a cost-effective solution for building AI models while maintaining control, quality, and privacy.

Databricks provides a cloud-based data warehouse service for serverless data management, so you won’t need to worry about data infrastructure. You can search the data using natural language and monitor and observe it with an AI-driven solution.

Methods to Load Data from Amazon S3 to Databricks

Are you wondering how to load Amazon S3 data in Databricks? This section discusses the most prominent methods to connect Amazon S3 to Databricks.

Method 1: Using Hevo to Sync Amazon S3 to Databricks

Hevo is a no-code, real-time ELT platform that provides cost-effective solutions to automate data pipelines according to your requirements. It enables you to integrate data instantly from 150+ sources (including +40 free sources) and load it into the destination of your choice.

Here are some of the critical advantages of using Hevo for data integration:

  • Data Transformation: Hevo provides Python-based and drag-and-drop data transformation techniques to clean and prepare your data for transformation.
  • Automated Schema Mapping: Hevo automates the schema management process by detecting the incoming data format and replicating it to become compatible with the destination schema. It lets you choose between Full and incremental Mappings according to your data replication requirements.
  • Incremental Data Load: With Hevo, you can transfer modified data in real-time, ensuring efficient bandwidth utilization at both the source and destination.

Configure Amazon S3 as a Source

This section will discuss the steps to set up Amazon S3 as a source in Hevo Data. But before proceeding, you must satisfy the prerequisites.


Prerequisites:

After satisfying the prerequisites, you can follow these steps:

  • Select PIPELINES in the Navigation Bar and click on + CREATE from the Pipeline List View.
  • Select S3 in the Select Source Type page.
  • Specify the mandatory fields on the Configure your S3 Source page.
Configure S3 Souce

Amazon S3 to Databricks: Configure your S3 source

  • Select TEST & CONTINUE.
  • Specify the necessary fields in the Data Root section.
Amazon s3 to Databricks data root section

Amazon S3 to Databricks: Data Root Section

  • Select CONFIGURE SOURCE and proceed with configuring the destination of the data pipeline.

To learn more about the steps involved in configuring Amazon S3 as a source, refer to Hevo Data Amazon S3 Documentation.

Configure Databricks as a Destination

Following the steps in this section, you can quickly configure Databricks as your data pipeline destination. Hevo supports Databricks integration on the AWS, Azure, and GCP platforms. You can follow two different methods to integrate Databricks as a destination on Hevo, including the recommended Databricks Partner Connect and using Databrick Credentials.

Before getting started, you must ensure you satisfy the prerequisites.


Prerequisites:

After satisfying all the prerequisite conditions, follow the steps below:

  • Select DESTINATIONS from the Navigation Bar and click + CREATE in the Destinations List View.
  • Select Databricks on the Add Destination page.
  • Specify the mandatory details on the Configure your Databricks Destination page.
Configure Databricks Destination

Amazon S3 to Databricks: Configure your Databricks Destination

  • Finally, click on TEST CONNECTION and select SAVE & CONTINUE. You can follow the instructions to identify the external location for Delta Tables.

To learn more about the steps involved in configuring Databricks as a destination, refer to Hevo Data Databricks Documentation.

Method 2: Accessing S3 Data in Databricks Using Apache Spark

This method uses Apache Spark to access Amazon S3 data and move it to Databricks. You can convert Amazon S3 to Databricks table by following these steps:

You can use this code snippet in the cluster’s Spark configuration to set Spark properties with the AWS key that is present in secret scopes as environment variables:

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

Finally, you can use the following command to read data from Amazon S3:

aws_bucket_name = "my-s3-bucket"
df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/")
display(df)
dbutils.fs.ls(f"s3a://{aws_bucket_name}/")

To learn more about the steps mentioned, refer to Connect to Amazon S3.

Limitations of Using Apache Spark to Connect Amazon S3 to Databricks

Although Amazon S3 to Databricks integration using Apache Spark is efficient, certain limitations come with this method. Here are some of the limitations of using Apache Spark for Amazon S3 to Databricks integration:

  • Complexity: This method requires prior technical knowledge to integrate data from Amazon S3 to Databricks. During this process, you might have to read about certain concepts, which can consume additional time.
  • Lack of Real-Time Integration: Accessing data from Amazon S3 to Databricks using Apache Spark can introduce some latency in the data transfer process. It can alter the data flow, resulting in the integration process lacking real-time features.

Method 3: Access Amazon S3 Bucket Using Instance Profiles

You can follow the steps given in the section to connect Amazon S3 to Databricks:

Limitations of Using Instance Profiles to Connect Amazon S3 to Databricks

  • Security Issues: Since one IAM role is associated with the instance profile, all users on a cluster using that profile can share the same access permissions for that role.

Method 4: Integrating Amazon S3 with Databricks Using Hadoop

This method highlights the integration of data from Amazon S3 with Databricks using Hadoop. Follow the steps below to do so:

  • You must configure the S3A filesystem using open-source Hadoop options in Databricks runtime.
    • To configure the global properties, follow this code snippet:
# Global S3 configuration
spark.hadoop.fs.s3a.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.endpoint <aws-endpoint>
spark.hadoop.fs.s3a.server-side-encryption-algorithm SSE-KMS
  • To configure per-bucket properties, you can remove the <placeholders> from the code syntax below using the specific options you want.
# Set up authentication and endpoint for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider <aws-credentials-provider-class>
spark.hadoop.fs.s3a.bucket.<bucket-name>.endpoint <aws-endpoint>

# Configure a different KMS encryption key for a specific bucket
spark.hadoop.fs.s3a.bucket.<bucket-name>.server-side-encryption.key <aws-kms-encryption-key>

Limitations of Using Hadoop to Connect Amazon S3 to Databricks

  • File Size Limitation: Hadoop requires large files for efficient processing, but Amazon S3 has a maximum file size limit of 5 gigabytes.
  • Time Consumption: The S3A option, which makes Amazon S3 bucket appear in a Hadoop-compatible filesystem, can slow down the data transfer process, consuming your time.

Method 5: Onboard Data from Amazon S3 to Databricks Using Unity Catalog

Using the Unity catalog, this method discusses data onboarding from Amazon S3 to Databricks. Before getting started, ensure you satisfy the given prerequisites.

Prerequisites:

After satisfying the prerequisites, you can follow the steps given below to move data from Amazon S3 to Databricks.

  • On the sidebar, click New and then select Notebook.
  • Enter the name of the notebook.
  • Click on the language option and select Python. Paste the following code in the notebook cell by replacing the placeholder values:
@dlt.table(table_properties={'quality': 'bronze'})
def <table-name>():
  return (
     spark.readStream.format('cloudFiles')
     .option('cloudFiles.format', '<file-format>')
     .load(f'{<path-to-source-data>}')
 )

Limitations of Onboarding Data from Amazon S3 to Databricks using Unity Catalog

  • Lack of ML Support: Shared access mode on the Unity catalog does not support Spark-submit jobs and machine learning libraries such as DataRuntime ML and Spark Machine Learning Library (MLlib)
  • Restrictions on View: The Unity catalog’s single-user access mode can restrict the dynamic views and querying tables created by the Delta Live Tables pipeline.

Method 6: Mount an S3 Bucket to Databricks

This method highlights how to mount an S3 bucket to Databricks. Follow the steps given below:

  • You must download the Access keys from AWS and upload the Excel file in the DBFS location.
  • You must read the file as a Spark DF using the given syntax.
aws_keys_df = spark.read.load("/FileStore/tables/<Your Folder>/<File Name.csv>", format="csv", inferSchema="true", header="true")
  • You must extract and store the values of access and secret keys in two different variables.
ACCESS_KEY = aws_keys_df.select('Access key ID').take(1)[0]['Access key ID']
SECRET_KEY = aws_keys_df.select('Secret access key').take(1)[0]['Secret access key']
  • You must encode the secret key by importing urllib library.
import urllib
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
  • You must create two variables, one for the S3 bucket name and another for the mount name.
AWS_S3_BUCKET = '<your s3 bucket name>'
MOUNT_NAME = '/mnt/mount_s3'
  • You can now connect S3 to Databricks and run it.
SOURCE_URL = 's3a://%s:%s@%s' %(ACCESS_KEY,ENCODED_SECRET_KEY,AWS_S3_BUCKET)
dbutils.fs.mount(SOURCE_URL, MOUNT_NAME)
  • You can check for the connection by following the given syntax:
%fs ls '/mnt/mount_s3/<your file name>'
  • Finally, you can convert your file into a Spark Dataframe. Follow the given command:
aws_s3_df = spark.read.load("/mnt/mount_s3/<File Name.csv>", format="csv", inferSchema="true", header="true")
display(aws_s3_df)

Limitations of Mounting an S3 Bucket on Databricks

  • Security Issues: All the users can have read and write access to all the objects in the bucket, which can lead to security issues.
  • Limitations in Options: Databricks does not support mounting an S3 bucket using an AWS key in GCP. Therefore, this option is only available for limited options, including Azure and AWS.

To reverse the data integration process and connect Databricks to Amazon S3, refer to Databricks to S3.

Conclusion

This article highlights six methods for moving data from Amazon S3 to Databricks table. Both methods can efficiently sync Amazon S3 to Databricks, but almost all the methods have limitations.

You can easily overcome these challenges by leveraging Hevo’s features to integrate data from Amazon S3 to Databricks. Hevo allows you to extract your data from 150+ data source connectors. With its highly interactive user interface, you can integrate data without prior coding knowledge.

Frequently Asked Questions (FAQs)

Q. What are the benefits of moving data to Databricks on AWS?

  1. There are multiple benefits of using Databricks on AWS. Here are some of them:
    1. AWS Databricks offers flexible pricing options for GPU instances, including reserved, on-demand, and spot instances.
    2. You can use AWS Databricks for better performance for the computational load. AWS enables you to enjoy services like Amazon Elastic Interface, Amazon FSx for Lustre, and many more.
    3. AWS Databricks offers better scalability options to scale your data up and down quickly.
Suraj Kumar Joshi
Freelance Technical Content Writer, Hevo Data

Suraj is a technical content writer specializing in AI and ML technologies, who enjoys creating machine learning models and writing about them.

All your customer data in one place.