Databricks is a popular Cloud-based solution that offers Data Warehousing and Analytics services. Businesses worldwide leverage this scalable tool to store their petabytes of data. It is the goto choice of Big Data professionals because of its flexibility and vast collaboration options. However, analyzing data directly in a Data Warehouse can be costly, and therefore companies seek other platforms that can store and process portions of their Databricks data. One such tool is Amazon S3.

Amazon S3 is an online service that offers flexible storage to companies. Its granular access control, metadata loading, and other such features, make it the first choice of all Data Analysts. Today, companies transfer information from Databricks to S3 to utilize scalable storage space but for a lower price.

This article will introduce you to Databricks and Amazon S3 along with their unique features. It will also provide you with 3 easy steps that you can use to set up the Databricks S3 Integration in real time. Read along to understand the importance of connecting Databricks to S3!

What is Databricks?

Databricks S3: Databricks Logo | Hevo Data
Image Source

Databricks is a robust Data Engineering tool that leverages Cloud-based technology to process and transform vast datasets. It also allows you to explore data at the granular level using its Machine Learning features. Recently, Azure combined with Databricks and made it the newest Big Data tool in the Microsoft Cloud environment. 

Databricks is critical for businesses as it allows organizations to perform Advanced Data Analysis by unifying Machine Learning with Extract Load Transform (ELT) processes. This tool operates in a distributed manner, meaning that it segments the workload into subtasks and assigns them to different processors. This allows you to scale your work according to necessity. Furthermore, Databricks also simplifies high-end Data Processing tasks and automates Machine Learning models to reduce your work complexity.  

Key Features of Databricks

Databricks offers the following unique features for managing your valuable data:  

  • High Data Compression: Databricks rely on the unified Spark Engine for large-scale Data Compression tasks. It also provides fast Data Streaming and Query Processing to simplify your data management tasks and create a  developer-friendly environment. 
  • Vast Collaborations: Databricks facilitate stakeholders’ collaboration especially if they are working with different programming languages. For instance, developers using Python, SQL, R, and Scala programming languages can easily combine their efforts using Databricks’s interactive platform.
  • Robust Security: Databricks contain multiple layers of data security that provide you with adequate features to regulate data access. Identity Management and Data Encryption are some examples of the security measures adopted by Databricks.
Simplify Data Streaming Using Hevo’s No Code Data Pipeline

Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 150+ Data Sources such as S3 straight into Databricks or any other Data Warehouse of your choice. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

GET STARTED WITH HEVO FOR FREE

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full-access free trial today to experience an entirely automated hassle-free Data Replication!

What is Amazon S3?

Databricks S3: Amazon S3 Logo | Hevo Data
Image Source

Amazon Simple Storage Service (S3) is an online storage service that provides you with collaborative and easy-to-use data storage facilities. Amazon S3 was built to simplify online data computing, storage, and retrieval. This Amazon platform allows you to access any chunk of data, from anywhere, over the Internet. 

Similar to other fully managed Amazon services, S3 provides a layer of abstraction between software and the user for operations such as scaling, resource pre-provisioning, etc. This implies, that you do not have to worry about maintaining any of the above activities, you just need to pay for the storage capacity that you wish to utilize. 

Key Features of Amazon S3

Amazon S3’s following features make it popular in today’s market:

  • Metadata Transfer: Amazon S3 lets you append metadata tags along with objects that are stored or transferred via its storage. This is beneficial in situations when the data is complex and requires extra information for user comprehension. 
  • Access Control: Amazon S3 provides strict access control of your data and thwarts & attacks involving unauthorized access. Moreover, its access control protocols are adequate to meet any standard compliance requirement.
  • Granular Control: Amazon S3 facilitates the complex process of running Big Data Analytics, Data Monitoring, etc., by providing you with granular access control. This allows you to access and utilize data at various levels of depth based on your requirements.

How to Setup Databricks S3 Integration?

Databricks S3:Databricks to S3 Connection | Hevo Data
Image Source

You can use the following steps to set up the Databricks S3 integration and analyze your data without any hassle:

Step 1: Mount an S3 Bucket to Establish Databricks S3 Connection

This step requires you to mount an S3 bucket by using the Databricks File System (DBFS). Since the mount is actually a pointer to a location in S3, the data sync is never performed locally.

Now, to connect Databrcks to S3, you can use an AWS instance profile for mounting an S3 bucket. The permission contained by the instance profile will determine the extent of access a user can have to the S3 bucket object. For instance, if the profile contains write access, your users can leverage the mount point to write objects directly in the S3 bucket. To implement the Databricks mount S3, configure your S3 Cluster using the following code:

aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))

Alternatively, to set up Databricks S3 integration, you can leverage AWS keys to mount a bucket as follows:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "<aws-bucket-name>"
mount_name = "<mount-name>"

dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))

This approach allows every user to get both read and write access for each object present in your S3 bucket.

Now, once you have created a mount point for your Cluster using any of the above methods, users of that cluster can access and use the mount point immediately. If you wish to use that mount point in a different cluster after connecting Databricks to S3, you will need to run the following code on that particular cluster:

 dbutils.fs.refreshMounts()

Step 2: Read/Write S3 Data Buckets for Databricks Data

Once the Databricks S3 connection is in place, you can utilize local file paths to access S3 objects using the following code:

df = spark.read.format("text").load("/mnt/%s/..." % mount_name)

or

df = spark.read.format("text").load("dbfs:/mnt/%s/..." % mount_name)
What Makes Hevo’s Data Streaming and Loading Unique?

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 150+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

Step 3: Unmount the S3 Bucket

Unmounting an S3 bucket is simple as you need to run just a single command as shown below:

dbutils.fs.unmount("/mnt/mount_name")

That’s it! Your Databricks S3 integration is in place and you can now easily transfer your Databricks data to Amazon S3 in real-time.

Step 4: Access S3 Buckets Directly (Optional Alternative)

This alternative method allows you to leverage Spark to access S3 bucket objects directly via AWS keys. This method relies on Databricks secrets to store the access keys. You can use the following python code to implement this alternative method of setting up Databricks S3 integration:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)

# If you are using Auto Loader file notification mode to load files, provide the AWS Region ID.
aws_region = "aws-region-id"
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com")

myRDD = sc.textFile("s3a://%s/.../..." % aws_bucket_name)
myRDD.count()

Advantages of Databricks S3 Integration

Modern businesses seek easy ways to establish the Databricks S3 connection because of the following advantages that this integration provides:

  • Amazon S3s feature of opt-in versioning is responsible for automatically backing up the modified (or deleted) files. This allows you to perform easy data recovery in case of a server crash or accidental deletion. Moreover, you can opt for Cross-region data replication to further enhance your data availability.
  • One of the key advantages of setting up the Databricks S3 integration is the flexibility and pay-as-you-go pricing model offered by the Cloud Data Lake. These features allow you to scale up your work anytime and ensure that you are never charged extra money for hidden services.
  • Transferring your data using Databricks S3 integration also provides you with one of the best performance per dollar ratios especially when you are dealing with petabytes of data. Furthermore, S3 requires only a minimum amount of human resources from your end as compared to other storage options which usually require a team of Data Engineers to maintain the storage facility.
  • Databricks offers you an integrated data architecture on S3 that is capable of managing Machine Learning algorithms, SQL Analytics, and Data Science. This way, Databricks S3 integration allows you to address all of your analytical and AI-based use cases on a single platform. Furthermore, this integration supports you in achieving Data Warehouse level performance in Data Lake economics.

Conclusion

The article introduced you to Databricks and Amazon S3 along with their key features. It also provided you with a step-by-step guide to set up the Databricks S3 integration. Furthermore, the blog discussed the multiple advantages of connecting these 2 tools. Using the 3 simple steps explained in this blog, you can seamlessly implement and utilize the Databricks S3 connection for your business.

Visit our Website to Explore Hevo

Now, to run queries or perform Data Analytics on your raw data, you first need to export this data to a Data Warehouse. This will require you to custom-code complex scripts to develop the ETL processes. Hevo Data can automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 150+ sources such as Amazon S3 to Data lakes such as Databricks, and Cloud-based Data Warehouses like Amazon Redshift, Snowflake, Google BigQuery, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your understanding of Databricks S3 Integration in the comments below!

mm
Former Research Analyst, Hevo Data

Abhinav is a data science enthusiast who loves data analysis and writing technical content. He has authored numerous articles covering a wide array of subjects in data integration and infrastructure.

No-code Data Pipeline For Amazon S3 & Databricks

Get Started with Hevo