Working with Spark Redshift Connector: Made Easy 101

on Amazon Redshift, Apache Spark, Data Integration, Data Warehouses, ETL • January 28th, 2022 • Write for Hevo

In this day and age, data has become an invaluable tool in the business and corporate world. It has become so prominent that numerous fields related to data emerged. Some of these fields include Data Science and Data Analysis. They rely on data and help draw conclusions that would help companies. With this in mind, these individuals need storage systems to house this information. This brings us to Data Warehousing tools, and more specifically, AWS Redshift. This is one of the most prominent Data Warehousing tools on the market today. Accordingly, it can handle data on the exabytes scale. While it has prominent features, it can be connected to other platforms for additional functionality such as Apache Spark SQL

This article gives you a brief but sufficient introduction to working with the Spark Redshift Connector. By the end, you should have a rough idea of what this two software and the benefits you stand to gain by using them together. Have a read below to learn how to set up Spark Redshift Connector.

Table of Contents

Introduction to Amazon Redshift

Spark Redshift Connector - Redshift Logo
Image Source

Amazon Redshift, AWS Redshift for short, is a popular data warehousing solution capable of handling data on an exabytes scale. It has become one of the most popular leaders in the Data Warehouse category due to the numerous benefits it offers. 

Amazon Redshift leverages Massively Parallel Processing (MPP) technology, which allows it to perform complex operations on large data volumes at fast speed. It can work with data on the exabytes scale usually denoted by 1018. That’s pretty impressive!

Data stored in Amazon Redshift is encrypted which provides an extra layer of security for users. It offers various features to enable users to import data easily with just few clicks.

Key Features of Amazon Redshift

Below are some of the key features that have enabled Amazon Redshift to stand apart from the pack: 

  • Automation Capabilities: With AWS Redshift, you do not have to perform repetitive tasks such as generating daily, weekly, or monthly reports as the platform has automation capabilities. 
  • Intelligent Optimization: When querying large data sets, there are several ways you can query information using the same parameters. AWS Redshift helps in such situations by providing tools and data to improve queries. The software will also offer tips to enhance the database automatically. 
  • SQL Friendly: AWS Redshift is based on PostgreSQL, meaning all SQL queries can work on the platform. 
  • Data Encryption: This is an extra security feature part of the Redshift operation. The user can decide which data needs encryption and which does not. 

Introduction to Apache Spark 

Spark Redshift Connector - Apache Spark Logo
Image Source

Apache Spark is an open-source, distributed processing system used for large data workloads. Numerous companies have embraced this software due to its numerous benefits such as speed. The platform utilizes RAM for data processing, making it much faster than disk drives. 

Now that you have a rough idea of Apache Spark, what does it entail? One of the most widely used components is Apache Spark SQL. Simply put, this is Apache Spark’s SQL wing, meaning it brings native support for SQL to the platform. Moreover, it streamlines the query process of data stored in RDDs and external sources. Other components include Spark Core, Spark Streaming, GraphX, and MLib

So what are the top features of the platform? Read on below to find out,

Key Features of Apache Spark

Some of the features of Apache Spark are listed below:

  • Lighting Fast Speed: As a Big Data Tool, Spark has to satisfy corporations’ needs of processing big data at high speed. Accordingly, the tool depends on Resilient Distributed Dataset (RDD),  where data is transparently stored on the memory, and read/write operations are carried out when needed. The benefit? is Disc read and write time is reduced, increasing speed. 
  • Supports Sophisticated Analytics: Apart from map and reduce operations, Spark supports SQL queries, data streaming, and advanced analytics. Its high-level components such as MLib, Spark Streaming, and Spark SQL make this possible. 
  • Real-Time Stream Processing: This tool is designed to handle real-time data streaming. Spark can recover lost work and deliver high-level functionality without requiring extra code. 
  • Usability: The software allows you to write scalable applications in several languages, including Java, Python, R, and Scala. Developers can use the language to query data from these languages’ shells. 

Why should you work with Spark Redshift Connector?

Image Source

By now, you should have a rough idea of what these two software are. Now, why should you use them together? What benefits do you get by using Redshift and Spark?

By using these two tools together, you can load data back and forth from Redshift to Spark. The benefit? You can utilize Sparks lightning speeds and its components for various functions on your data and use Redahifts querying capabilities simultaneously. So, how exactly do you go about this process?

You are going to utilize the Spark Redshift Connector. This is a library that enables you to load data into Spark SQL DataFrames from Redshift and write them back to Redshift tables. For installation, you can head on to GitHub and see how to implement this tool.

Simplify Redshift ETL and Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources including 40+ Free Sources. It loads the data onto the desired Data Warehouse such as Amazon Redshift and transforms it into an analysis-ready form without having to write a single line of code. Hevo’s fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data.

GET STARTED WITH HEVO FOR FREE

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your ETL & Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Steps to Set Up Spark Redshift Connector

Now, let’s get to the actual process of loading data from Redshift to Spark and vice versa. Before using the mentioned library, we need to perform a few simple tasks. Follow the steps below:

Step 1: Add JAR File for Spark Redshift Connector

You need to add the Jar file to the spark-submit command. Enter the following command:

--jars https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.36.1060/RedshiftJDBC42-no-awssdk-1.2.36.1060.jar

Keep in mind that the path should be of the same location as the Redshift Jar file. 

Step 2: Add Packages for Spark Redshift Connector

Next up, you need to add the package names in the spark-submit command as illustrated below: 

--packages org.apache.spark:spark-avro_2.11:2.4.2,io.github.spark-redshift-community:spark-redshift_2.11:4.0.1

Step 3: Read & Write Data using Spark Redshift Connector

After finishing up with the outlined steps, you can now use the Spark Redshift Connector. The pyspark code below is used to read and write data:

# Create or get a Spark Session with the name spark
spark = SparkSession 
    .builder() 
    .appName("sparkRedshiftExample") 
    .getOrCreate()
# Read data from a table
df_read_1 = spark.read 
    .format("io.github.spark_redshift_community.spark.redshift") 
    .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 
    .option("dbtable", "table_name") 
    .option("tempdir", "s3n://path/for/temp/data") 
    .load()
# Read data from a query
df_read_2 = spark.read 
    .format("io.github.spark_redshift_community.spark.redshift") 
    .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 
    .option("query", "select x, count(*) table_name group by x") 
    .option("tempdir", "s3n://path/for/temp/data") 
    .load()
# Write back to a table
df_read_1.write 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 
  .option("dbtable", "table_name") 
  .option("tempdir", "s3n://path/for/temp/data") 
  .mode("error") 
  .save()
# Using IAM Role based authentication
df_read_2.write 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") 
  .option("dbtable", "table_name") 
  .option("tempdir", "s3n://path/for/temp/data") 
  .option("aws_iam_role", "your-role-here") 
  .mode("error") 
  .save()

Upon analyzing the code, you will realize that it follows a stepwise procedure. The first step is to create the Spark Session which is the entry point for the spark application. The driver uses the spark session to coordinate resources provided by the resource manager. 

After you create the Spark session inside the code, the read function is what you use to read data from Redshift Tables and assign it to the dataframe called df_read_1. Upon a closer look at the table, you will realize that instead of reading the entire table, you are reading the result of a query. Next up, the write function in the code is used to write data from the dataframe to the Redshift database. Note that the code specified the mode option, which is the mode used to write data on Redshift. 

In summary, the Pyspark code has two major components: the read section used to read data from the dataframe and the write option used to write data to Redshift.

Conclusion

This post taught you what AWS Redshift and Spark are and how and why you should use them together. Using the Spark Redshift Connector, you can load data from Redshift to Spark and write them back to Redshift. 

However, streaming data from various sources to Amazon Redshift can be quite challenging and cumbersome. If you are facing these challenges and are looking for some solutions, then check out a simpler alternative like Hevo.

Hevo Data is a No-Code Data Pipeline that offers a faster way to move data from 100+ Data Sources to Amazon Redshift, and other 40+ Free Sources, into your Data Warehouse to be visualized in a BI tool.

VISIT OUR WEBSITE TO EXPLORE HEVO

Want to take Hevo for a spin?

SIGN UP and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Feel free to share your experience with Spark Redshift Connector with us in the comments section below!

No-Code Data Pipeline For Your Amazon Redshift