Databricks is a popular Cloud-based solution that offers Data Warehousing and Analytics services. Businesses worldwide leverage this scalable tool to store their petabytes of data. It is the goto choice of Big Data professionals because of its flexibility and vast collaboration options.
However, analyzing data directly in a Data Warehouse can be costly, and therefore companies seek other platforms that can store and process portions of their Databricks data. One such tool is Amazon S3.
Amazon S3 is an online service that offers flexible storage to companies. Its granular access control, metadata loading, and other such features, make it the first choice of all Data Analysts. Today, companies transfer information from Databricks to S3 to utilize scalable storage space but for a lower price.
This article will introduce you to Databricks and Amazon S3 along with their unique features. It will also provide you with 3 easy steps that you can use to set up the Databricks S3 Integration in real time. Read along to understand the importance of connecting Databricks to S3!
What is Databricks?
Databricks is a robust Data Engineering tool that leverages Cloud-based technology to process and transform vast datasets. It also allows you to explore data at the granular level using its Machine Learning features. Recently, Azure combined with Databricks and made it the newest Big Data tool in the Microsoft Cloud environment.
Databricks is critical for businesses as it allows organizations to perform Advanced Data Analysis by unifying Machine Learning with Extract Load Transform (ELT) processes.
This tool operates in a distributed manner, meaning that it segments the workload into subtasks and assigns them to different processors. This allows you to scale your work according to necessity.
Furthermore, Databricks also simplifies high-end Data Processing tasks and automates Machine Learning models to reduce your work complexity.
Key Features of Databricks
Databricks offers the following unique features for managing your valuable data:
- High Data Compression: Databricks rely on the unified Spark Engine for large-scale Data Compression tasks. It also provides fast Data Streaming and Query Processing to simplify your data management tasks and create a developer-friendly environment.
- Vast Collaborations: Databricks facilitate stakeholders’ collaboration especially if they are working with different programming languages. For instance, developers using Python, SQL, R, and Scala programming languages can easily combine their efforts using Databricks’s interactive platform.
- Robust Security: Databricks contain multiple layers of data security that provide you with adequate features to regulate data access. Identity Management and Data Encryption are some examples of the security measures adopted by Databricks.
What is Amazon S3?
Amazon Simple Storage Service (S3) is an online storage service that provides you with collaborative and easy-to-use data storage facilities. Amazon S3 was built to simplify online data computing, storage, and retrieval. This Amazon platform allows you to access any chunk of data, from anywhere, over the Internet.
Similar to other fully managed Amazon services, S3 provides a layer of abstraction between software and the user for operations such as scaling, resource pre-provisioning, etc.
This implies, that you do not have to worry about maintaining any of the above activities, you just need to pay for the storage capacity that you wish to utilize.
Key Features of Amazon S3
Amazon S3’s following features make it popular in today’s market:
- Metadata Transfer: Amazon S3 lets you append metadata tags along with objects that are stored or transferred via its storage. This is beneficial in situations when the data is complex and requires extra information for user comprehension.
- Access Control: Amazon S3 provides strict access control of your data and thwarts & attacks involving unauthorized access. Moreover, its access control protocols are adequate to meet any standard compliance requirement.
- Granular Control: Amazon S3 facilitates the complex process of running Big Data Analytics, Data Monitoring, etc., by providing you with granular access control. This allows you to access and utilize data at various levels of depth based on your requirements.
Integrate Amazon S3 to Databricks
Integrate MongoDB to Databricks
Integrate Google Analytics to Databricks
How to Setup Databricks S3 Integration?
You can use the following steps to set up the Databricks S3 integration and analyze your data without any hassle:
Step 1: Mount an S3 Bucket to Establish Databricks S3 Connection
This step requires you to mount an S3 bucket by using the Databricks File System (DBFS). Since the mount is actually a pointer to a location in S3, the data sync is never performed locally.
Now, to connect Databrcks to S3, you can use an AWS instance profile for mounting an S3 bucket. The permission contained by the instance profile will determine the extent of access a user can have to the S3 bucket object.
For instance, if the profile contains write access, your users can leverage the mount point to write objects directly in the S3 bucket. To implement the Databricks mount S3, configure your S3 Cluster using the following code:
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))
Alternatively, to set up Databricks S3 integration, you can leverage AWS keys to mount a bucket as follows:
access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))
This approach allows every user to get both read and write access for each object present in your S3 bucket.
Now, once you have created a mount point for your Cluster using any of the above methods, users of that cluster can access and use the mount point immediately.
If you wish to use that mount point in a different cluster after connecting Databricks to S3, you will need to run the following code on that particular cluster:
dbutils.fs.refreshMounts()
Step 2: Read/Write S3 Data Buckets for Databricks Data
Once the Databricks S3 connection is in place, you can utilize local file paths to access S3 objects using the following code:
df = spark.read.format("text").load("/mnt/%s/..." % mount_name)
or
df = spark.read.format("text").load("dbfs:/mnt/%s/..." % mount_name)
Step 3: Unmount the S3 Bucket
Unmounting an S3 bucket is simple as you need to run just a single command as shown below:
dbutils.fs.unmount("/mnt/mount_name")
That’s it! Your Databricks S3 integration is in place and you can now easily transfer your Databricks data to Amazon S3 in real-time.
Step 4: Access S3 Buckets Directly (Optional Alternative)
This alternative method allows you to leverage Spark to access S3 bucket objects directly via AWS keys. This method relies on Databricks secrets to store the access keys.
You can use the following python code to implement this alternative method of setting up Databricks S3 integration:
access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")
myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()
Advantages of Databricks S3 Integration
Modern businesses seek easy ways to establish the Databricks S3 connection because of the following advantages that this integration provides:
- Amazon S3s feature of opt-in versioning is responsible for automatically backing up the modified (or deleted) files. This allows you to perform easy data recovery in case of a server crash or accidental deletion. Moreover, you can opt for Cross-region data replication to further enhance your data availability.
- One of the key advantages of setting up the Databricks S3 integration is the flexibility and pay-as-you-go pricing model offered by the Cloud Data Lake. These features allow you to scale up your work anytime and ensure that you are never charged extra money for hidden services.
- Transferring your data using Databricks S3 integration also provides you with one of the best performance per dollar ratios especially when you are dealing with petabytes of data. Furthermore, S3 requires only a minimum amount of human resources from your end as compared to other storage options which usually require a team of Data Engineers to maintain the storage facility.
- Databricks offers you an integrated data architecture on S3 that is capable of managing Machine Learning algorithms, SQL Analytics, and Data Science. This way, Databricks S3 integration allows you to address all of your analytical and AI-based use cases on a single platform. Furthermore, this integration supports you in achieving Data Warehouse level performance in Data Lake economics.
Read More About:
Conclusion
The article introduced you to Databricks and Amazon S3 along with their key features. It also provided you with a step-by-step guide to set up the Databricks S3 integration.
Furthermore, the blog discussed the multiple advantages of connecting these 2 tools. Using the 3 simple steps explained in this blog, you can seamlessly implement and utilize the Databricks S3 connection for your business.
Abhinav Chola, a data science enthusiast, is dedicated to empowering data practitioners. After completing his Master’s degree in Computer Science from NITJ, he joined Hevo as a Research Analyst and works towards solving real-world challenges in data integration and infrastructure. His research skills and ability to explain complex technical concepts allow him to analyze complex data sets, identify trends, and translate his insights into clear and engaging articles.