One of the most popular all-purpose, distributed data-processing engines is Apache Spark. It is used on a regular basis by many large enterprises for a variety of use cases. On top of Spark Processing Units, Spark includes several libraries for SQL, Machine Learning, Graph Computation, and Stream Processing that can be utilized together in an application.

PostgreSQL is a free and Open-Source Relational Database Management System (RDBMS) that stresses extensibility and SQL conformance. In PostgreSQL, structured objects are utilized instead of documents to hold data. It uses standard SQL syntax and style. It aids in the management of data, regardless of how large, small, or diverse the dataset is, so you can use it to manage or analyze your big data, and there are various ways to do so, including Apache Spark.

In this article, you will learn everything about Spark PostgreSQL integration along with the process which you might want to carry out while using it to make the process run smoothly.

What is PostgreSQL?

Postgresql logo
Image Source

PostgreSQL, sometimes known as Postgres, is a free and Open-Source Relational Database Management System (RDBMS) that emphasizes extensibility and SQL conformance. It was created at the University of California, Berkeley, and initially released on July 8, 1996. It was the successor to the Ignes database. Structured objects are used instead of documents to store data in PostgreSQL. It employs conventional SQL syntax and formatting.

It’s written in C and has a monolithic design, which implies that all of the components are linked and work in a logical order. It offers free community support as well as additional help to some of its premium customers. It is extensively used in the Healthcare, Finance, and Industrial areas due to its innovative backup methods.

Sync your PostgreSQL Server Data with Hevo

Unlock the power of your PostgreSQL data by seamlessly connecting it to various destinations, enabling comprehensive analysis in tools like Google Data Studio.

Check out Why Hevo is the right choice for you!

  • No-Code Platform: Easily set up and manage your data pipelines without any coding.
  • Auto-Schema Mapping: Automatically map schemas for smooth data transfer.
  • Pre and Post-Load Transformations: Transform your data at any stage of the migration process.
  • Real-Time Data Ingestion: Keep your data up-to-date with real-time synchronization.

Join over 2000 happy customers who trust Hevo for their data integration needs and experience why we are rated 4.7 on Capterra.

Get Started with Hevo for Free

Key Features of PostgreSQL

PostgreSQL offers a few unique features that make it a viable alternative to other mainstream RDBMSs. Some of these qualities are as follows:

  • It complies with the ACID (Atomicity, Consistency, Isolation, and Durability) standards and runs on a variety of operating systems.
  • Due to many fail-safes and redundancy, Storage is dependable.
  • Anyone can use all of its features for free because it is Open-Source.
  • It has Limited Scalability because its processing capacity is determined by the machine it operates on.
  • PostgreSQL is capable of handling a diverse set of data, documents, and customizations.
  • It has a Monolithic Architecture, which implies that all of the components automatically work together.
  • It offers a robust Access Control System that includes features like row and column level security as well as multi-factor authentication with certificates.
  • It’s ideal for transactional workflows, such as those found in banking systems, risk assessments, BI (Commercial Intelligence), and powering a wide range of business applications.

What is Apache Spark?

Apache spark logo

Hadoop is a popular tool in the industry for analyzing enormous amounts of data. This is because the Hadoop framework is built on the MapReduce programming style, which enables a Scalable, Flexible, Fault-Tolerant, and Cost-Effective Computing Solution. The key goal is to keep processing huge Datasets as fast as possible in terms of Query Response Time and Program Execution Time.

Spark was provided by the Apache Software Foundation to speed up the Hadoop computational computing software process. Apache Spark is an Open-Source, Scalable, and Distributed General-Purpose Computing Engine for processing and analyzing huge data files from a variety of sources, including HDFS, S3, Azure, and others.

Key Features of Apache Spark

  • Lighting Fast Speed: Spark, as a Big Data Tool, must meet organizations’ demands for high-speed processing of large amounts of data. As a result, the tool relies on the Resilient Distributed Dataset (RDD), which stores data transparently in memory and performs read/write operations as needed. The advantage is that disc read and writes times are lowered, resulting in increased speed.
  • Supports Advanced Analytical Techniques: Spark also enables SQL queries, data streaming, and advanced analytics in addition to mapping and reducing operations. This is made feasible by high-level components like MLib, Spark Streaming, and Spark SQL.
  • Real-Time Stream Processing: This tool is designed to handle real-time data streaming. Spark is capable of recovering lost work and delivering high-level functionality without the need for additional code.
  • Usability: The software allows you to develop scalable apps in Java, Python, R, and Scala, among other languages. Developers can query data from these languages’ shells.

What are the Key Libraries Supported by Apache Spark?

Apache Spark comes with a number of libraries, including:

  • Spark SQL: It is a module for utilizing SQL or the DataFrame API to work with structured data. It gives you a unified interface to Hive, Avro, Parquet, ORC, JSON, and JDBC, among other data sources. You can even combine data from different sources.
  • Spark Streaming: It makes it simple to create scalable, fault-tolerant streaming applications by providing a language-integrated API for stream processing, allowing you to construct streaming jobs in the same manner that batch jobs are written. Java, Scala, and Python are all supported. Spark Streaming recovers both lost work and operator state right out of the box, so you don’t have to write any additional code. It allows you to reuse the same code for batch processing, joining streams to historical data, and doing ad-hoc stream state queries.
  • MLib (Machine Learning): Machine Learning library that is scalable. MLlib includes high-quality algorithms that use iteration to produce better results than the one-pass approximations that MapReduce occasionally uses.
  • GraphX: It is a Graph API that allows for graph-parallel processing. GraphX is a single system that combines ETL, exploratory analysis, and iterative graph computation. The Pregel API allows you to see the same data as graphs and collections, rapidly transform and combine graphs with RDDs, and implement custom iterative graph algorithms.

How to do Spark PostgreSQL Integration?

Step 1: Install the PostgreSQL JDBC Driver

The first step in Spark PostgreSQL is to Install and run the Postgres server, for example on localhost on port 7433. Create a test_db database with two tables, person and class:

test_db=# create table person(name text, age int);
test_db=# insert into person values('shu', 21);
test_db=# insert into person values('zhe', 20);
test_db=# select * from person;
name | age
-----+-----
shu  | 21
zhe  | 20
(2 rows)

test_db=# create table class(name text, class text);
test_db=# inert into class values ('shu', 'nursing'), 
('shu', 'coffee'), 
('zhe', 'maths'), 
('zhe', 'climbing');

test_db=# select *from class;
 name | class
------+----------
 shu  | nursing
 shu  | coffee
 zhe  | maths
 zhe  | climbing
(4 rows)
  • A person table is created with columns for name (text) and age (integer), and two records are inserted.
  • The first query selects all entries from the person table, displaying the names and ages of two individuals.
  • A class table is created with columns for name (text) and class (text).
  • Multiple records are inserted into the class table, associating each person with different classes.
  • A select query retrieves all entries from the class table, showing the names and their respective classes.

You will need a JDBC connection to connect Apache Spark to your PostgreSQL database.

$ wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar

Step 2: Install Apache Spark Packages

The next step in Spark PostgreSQL Integration is to set up Apache Spark. You will need to get the spark packages from here for this.

$ wget http://us.mirrors.quenda.co/apache/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
$ tar zxvf spark-2.4.3-bin-hadoop2.7.tgz
$ cd spark-2.4.3-bin-hadoop2.7/
  • The wget command downloads the Apache Spark version 2.4.3 archive file from the specified URL.
  • The tar command extracts the contents of the downloaded spark-2.4.3-bin-hadoop2.7.tgz file.
  • The cd command changes the current directory to the extracted Spark folder spark-2.4.3-bin-hadoop2.7.
  • After executing these commands, the user is ready to work with Apache Spark from the specified directory.

Step 3: Execute Apache Spark Shell on your System

You will need JAVA installed on your Server/System to launch the Spark Shell for Spark PostgreSQL Integration:

$  yum install java

You can now execute your Spark Shell:

$ ./bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ApacheSpark1:4040
Spark context available as 'sc' (master = local[*], app id = local-1563907528854).
Spark session available as 'spark'.
Welcome to
      ____          	__
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 2.4.3
      /_/
 
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
 
scala>
  • The command ./bin/spark-shell launches the Apache Spark interactive shell for running Spark applications using Scala.
  • It shows that Spark is using its default logging configuration and sets the default log level to “WARN.”
  • The Spark web UI is accessible at http://ApacheSpark1:4040, where users can monitor the Spark application.
  • The Spark context (named ‘sc’) and session (named ‘spark’) are initialized, indicating the application is ready to execute Spark jobs.
  • Users can type Scala expressions in the shell for immediate evaluation, and help information can be accessed by typing :help.

The PostgreSQL JDBC Driver must be added to the Spark shell:

scala> :require /path/to/postgresql-42.2.6.jar
Added '/path/to/postgresql-42.2.6.jar' to classpath.
scala> import java.util.Properties
import java.util.Properties
  • The command :require /path/to/postgresql-42.2.6.jar adds the PostgreSQL JDBC driver JAR file to the classpath, enabling Spark to connect to a PostgreSQL database.
  • The message confirms that the specified JAR file has been successfully added to the classpath.
  • The command import java.util.Properties imports the Properties class from the Java standard library, allowing the user to create and manipulate a set of key-value pairs.
  • This setup is typically used to configure database connection properties, such as the database URL, username, and password.
  • With the JDBC driver and Properties class imported, the user can now establish a connection to a PostgreSQL database within the Spark shell.

Step 4: Add JDBC Driver Information in Spark

The final step in Spark PostgreSQL Integration is to add the following JDBC information for Spark to use:

scala> val url = "jdbc:postgresql://localhost:5432/testing"
url: String = jdbc:postgresql://localhost:5432/testing
scala> val connectionProperties = new Properties()
connectionProperties: java.util.Properties = {}
scala> connectionProperties.setProperty("Driver", "org.postgresql.Driver")
res6: Object = null
  • The line val url = "jdbc:postgresql://localhost:5432/testing" creates a string variable url that specifies the JDBC connection string to a PostgreSQL database named “testing” running on localhost at port 5432.
  • The command val connectionProperties = new Properties() initializes a new Properties object named connectionProperties, which will hold configuration settings for the database connection.
  • The line connectionProperties.setProperty("Driver", "org.postgresql.Driver") sets the JDBC driver property for the PostgreSQL database, allowing Spark to use the specified driver to connect to the database.
  • The result res6: Object = null indicates that the setProperty method does not return any value (i.e., it returns null), which is typical for setter methods.
  • These steps are part of preparing to connect to a PostgreSQL database using Spark by defining the database URL and necessary connection properties.

You have successfully set up Spark PostgreSQL Integration!

How to use Spark PostgreSQL Together? 

The following are the most common operations using while Spark PostgreSQL together. Let’s assume your PostgreSQL Cluster is already up and running. You will need a PostgreSQL 11 server running on CentOS7 for the operations below.

Set up your PostgreSQL Database

Let’s start by creating a test database on your PostgreSQL server:

postgres=# CREATE DATABASE testing;
CREATE DATABASE
postgres=# c testing
You are now connected to database "testing" as user "postgres".

Create Tables in your PostgreSQL Database

You’re going to make a new table called t1:

testing=# CREATE TABLE t1 (id int, name text);
CREATE TABLE

Insert Data into your PostgreSQL Tables

You can insert some data there:

testing=# INSERT INTO t1 VALUES (1,'name1');
INSERT 0 1
testing=# INSERT INTO t1 VALUES (2,'name2');
INSERT 0 1

You can check the data gathered:

testing=# SELECT * FROM t1;
 id | name
----+-------
  1 | name1
  2 | name2
(2 rows)

Create a Spark Dataframe for PostgreSQL Data

You can create the DataFrame as follows:

scala> val query1df = spark.read.jdbc(url, query1, connectionProperties)
query1df: org.apache.spark.sql.DataFrame = [id: int, name: string]

So, now you can do anything with this DataFrame:

scala> query1df.show()
+---+-----+
| id| name|
+---+-----+
|  1|name1|
|  2|name2|
+---+-----+
scala> query1df.explain
== Physical Plan ==
*(1) Scan JDBCRelation((SELECT * FROM t1) as q1) [numPartitions=1] [id#19,name#20] PushedFilters: [], ReadSchema: struct<id:int,name:string>

Run SQL Queries on your Spark Dataframe

Now you are ready to run SQL queries. Let’s start with query1, which will be SELECT * FROM t1, your testing table.

scala> val query1 = "(SELECT * FROM t1) as q1"
query1: String = (SELECT * FROM t1) as q1

Verify Query Outputs in PostgreSQL

You can add more values and run it again to ensure that the existing values are returned.

PostgreSQL

testing=# INSERT INTO t1 VALUES (10,'name10'), (11,'name11'), (12,'name12'), (13,'name13'), (14,'name14'), (15,'name15');
INSERT 0 6
testing=# SELECT * FROM t1;
 id |  name
----+--------
  1 | name1
  2 | name2
 10 | name10
 11 | name11
 12 | name12
 13 | name13
 14 | name14
 15 | name15
(8 rows)

Spark

scala> query1df.show()
+---+------+
| id|  name|
+---+------+
|  1| name1|
|  2| name2|
| 10|name10|
| 11|name11|
| 12|name12|
| 13|name13|
| 14|name14|
| 15|name15|
+---+------+

How to use Pushdown Operators with Dataset and Dataframes in Spark PostgreSQL? 

When the computation is organized with Dataset or DataFrame in Spark PostgreSQL, it turns out the simple filters are also passed down to the database, but not more sophisticated ones. Joins are also not pushed down.

Simple Filters:

scala> personDf.where("age != 23").explain
== Physical Plan ==
*(1) Scan JDBCRelation((select * from person) as person)
  [numPartitions=1] [name#43,age#44] PushedFilters: [*IsNotNull(age),
  *Not(EqualTo(age,23))], ReadSchema: struct<name:string,age:int>

Other filters, such as those that involve arithmetic or user-defined actions or functions, are not affected:

scala> personDf.where("age + 1 != 23").explain
== Physical Plan ==
*(1) Filter NOT ((age#44 + 1) = 23)
+- *(1) Scan JDBCRelation((select * from person) as person)
      [numPartitions=1] [name#43,age#44] PushedFilters: [*IsNotNull(age)],
      ReadSchema: struct<name:string,age:int>

On the other hand, in Spark PostgreSQL, Joins are not pulled down at all:

scala> val classDf = spark.read.jdbc(url, "(select name, class from
  class) as class", connectionProperties)
scala> val joinDf = personDf.join(classDf, "name")
joinDf: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1
  more field]
scala> joinDf.explain
== Physical Plan ==
*(5) Project [name#0, age#1, class#14]
+- *(5) SortMergeJoin [name#0], [name#13], Inner
   :- *(2) Sort [name#0 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(name#0, 200)
   :     +- *(1) Scan JDBCRelation((select * from person) as p) 
                [numPartitions=1] [name#0,age#1] PushedFilters: 
                [*IsNotNull(name)], ReadSchema: struct<name:string,age:int>
   +- *(4) Sort [name#13 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(name#13, 200)
         +- *(3) Scan JDBCRelation((select * from class) as c) 
                [numPartitions=1] [name#13,class#14] PushedFilters: 
                [*IsNotNull(name)], ReadSchema: 
                [struct<name:string,class:string>

The two tables are loaded in their whole form Postgres, and the join is performed within Spark using a sort-merge join.

Integrate PostgreSQL to BigQuery
Integrate Kafka to PostgreSQL
Integrate PostgreSQL on Google Cloud SQL to Databricks

Conclusion

This article teaches you about Spark PostgreSQL Integration. It provides in-depth knowledge about the concepts behind every step to help you understand and implement them efficiently. In case you want to transfer data into your desired Database/destination, then Hevo Data is the right choice for you! 

While PostgreSQL is a good solution for Data Integration, It requires a lot of Engineering Bandwidth & Expertise. This can be challenging, resource-intensive & costly in the long run. Hevo offers a much simpler, scalable, and economical solution that allows people to create a Data Pipeline without any code in minutes & without depending on Engineering teams.

Sign up for a 14-day free trial and simplify your data integration process. Check out the pricing details to understand which plan fulfills all your business needs.

FAQ on Spark PostgreSQL Integration

1. Does Spark work with Postgres?

Yes. Apache Spark can work with PostgreSQL. To read and write data, Spark can connect to PostgreSQL using JDBC (Java Database Connectivity).

2. How to read data from Postgres in Spark?

You can read data from PostgreSQL in Spark using the JDBC connector.

3. What is Spark in the database?

Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing. Spark can handle both batch and real-time data processing.

Harsh Varshney
Research Analyst, Hevo Data

Harsh is a data enthusiast with over 2.5 years of experience in research analysis and software development. He is passionate about translating complex technical concepts into clear and engaging content. His expertise in data integration and infrastructure shines through his 100+ published articles, helping data practitioners solve challenges related to data engineering.