One of the most popular all-purpose, distributed data-processing engines is Apache Spark. It is used on a regular basis by many large enterprises for a variety of use cases. On top of Spark Processing Units, Spark includes several libraries for SQL, Machine Learning, Graph Computation, and Stream Processing that can be utilized together in an application.

PostgreSQL is a free and Open-Source Relational Database Management System (RDBMS) that stresses extensibility and SQL conformance. In PostgreSQL, structured objects are utilized instead of documents to hold data. It uses standard SQL syntax and style. It aids in the management of data, regardless of how large, small, or diverse the dataset is, so you can use it to manage or analyze your big data, and there are various ways to do so, including Apache Spark.

In this article, you will learn everything about Spark PostgreSQL integration along with the process which you might want to carry out while using it to make the process run smoothly.

What is PostgreSQL?

Postgresql logo
Image Source

PostgreSQL, sometimes known as Postgres, is a free and Open-Source Relational Database Management System (RDBMS) that emphasizes extensibility and SQL conformance. It was created at the University of California, Berkeley, and initially released on July 8, 1996. It was the successor to the Ignes database. Structured objects are used instead of documents to store data in PostgreSQL. It employs conventional SQL syntax and formatting.

It’s written in C and has a monolithic design, which implies that all of the components are linked and work in a logical order. It offers free community support as well as additional help to some of its premium customers. It is extensively used in the Healthcare, Finance, and Industrial areas due to its innovative backup methods.

Key Features of PostgreSQL

PostgreSQL offers a few unique features that make it a viable alternative to other mainstream RDBMSs. Some of these qualities are as follows:

  • It complies with the ACID (Atomicity, Consistency, Isolation, and Durability) standards and runs on a variety of operating systems.
  • Due to many fail-safes and redundancy, Storage is dependable.
  • Anyone can use all of its features for free because it is Open-Source.
  • It has Limited Scalability because its processing capacity is determined by the machine it operates on.
  • PostgreSQL is capable of handling a diverse set of data, documents, and customizations.
  • It has a Monolithic Architecture, which implies that all of the components automatically work together.
  • It offers a robust Access Control System that includes features like row and column level security as well as multi-factor authentication with certificates.
  • It’s ideal for transactional workflows, such as those found in banking systems, risk assessments, BI (Commercial Intelligence), and powering a wide range of business applications.

To learn more about PostgreSQL, you can check the official website here.

What is Apache Spark?

Apache spark logo
Image Source

Hadoop is a popular tool in the industry for analyzing enormous amounts of data. This is because the Hadoop framework is built on the MapReduce programming style, which enables a Scalable, Flexible, Fault-Tolerant, and Cost-Effective Computing Solution. The key goal is to keep processing huge Datasets as fast as possible in terms of Query Response Time and Program Execution Time.

Spark was provided by the Apache Software Foundation to speed up the Hadoop computational computing software process. Apache Spark is an Open-Source, Scalable, and Distributed General-Purpose Computing Engine for processing and analyzing huge data files from a variety of sources, including HDFS, S3, Azure, and others.

Key Features of Apache Spark

  • Lighting Fast Speed: Spark, as a Big Data Tool, must meet organizations’ demands for high-speed processing of large amounts of data. As a result, the tool relies on the Resilient Distributed Dataset (RDD), which stores data transparently in memory and performs read/write operations as needed. The advantage is that disc read and writes times are lowered, resulting in increased speed.
  • Supports Advanced Analytical Techniques: Spark also enables SQL queries, data streaming, and advanced analytics in addition to mapping and reducing operations. This is made feasible by high-level components like MLib, Spark Streaming, and Spark SQL.
  • Real-Time Stream Processing: This tool is designed to handle real-time data streaming. Spark is capable of recovering lost work and delivering high-level functionality without the need for additional code.
  • Usability: The software allows you to develop scalable apps in Java, Python, R, and Scala, among other languages. Developers can query data from these languages’ shells.

To learn more about Apache Spark, you can check the official website here.+

What are the Key Libraries Supported by Apache Spark?

Apache Spark comes with a number of libraries, including:

  • Spark SQL: It is a module for utilizing SQL or the DataFrame API to work with structured data. It gives you a unified interface to Hive, Avro, Parquet, ORC, JSON, and JDBC, among other data sources. You can even combine data from different sources.
  • Spark Streaming: It makes it simple to create scalable, fault-tolerant streaming applications by providing a language-integrated API for stream processing, allowing you to construct streaming jobs in the same manner that batch jobs are written. Java, Scala, and Python are all supported. Spark Streaming recovers both lost work and operator state right out of the box, so you don’t have to write any additional code. It allows you to reuse the same code for batch processing, joining streams to historical data, and doing ad-hoc stream state queries.
  • MLib (Machine Learning): Machine Learning library that is scalable. MLlib includes high-quality algorithms that use iteration to produce better results than the one-pass approximations that MapReduce occasionally uses.
  • GraphX: It is a Graph API that allows for graph-parallel processing. GraphX is a single system that combines ETL, exploratory analysis, and iterative graph computation. The Pregel API allows you to see the same data as graphs and collections, rapidly transform and combine graphs with RDDs, and implement custom iterative graph algorithms.

How to do Spark PostgreSQL Integration?

Step 1: Install the PostgreSQL JDBC Driver

The first step in Spark PostgreSQL is to Install and run the Postgres server, for example on localhost on port 7433. Create a test_db database with two tables, person and class:

test_db=# create table person(name text, age int);
test_db=# insert into person values('shu', 21);
test_db=# insert into person values('zhe', 20);
test_db=# select * from person;
name | age
-----+-----
shu  | 21
zhe  | 20
(2 rows)

test_db=# create table class(name text, class text);
test_db=# inert into class values ('shu', 'nursing'), 
('shu', 'coffee'), 
('zhe', 'maths'), 
('zhe', 'climbing');

test_db=# select *from class;
 name | class
------+----------
 shu  | nursing
 shu  | coffee
 zhe  | maths
 zhe  | climbing
(4 rows)

You will need a JDBC connection to connect Apache Spark to your PostgreSQL database. It’s available for download here.

$ wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar

Step 2: Install Apache Spark Packages

The next step in Spark PostgreSQL Integration is to set up Apache Spark. You will need to get the spark packages from here for this.

$ wget http://us.mirrors.quenda.co/apache/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
$ tar zxvf spark-2.4.3-bin-hadoop2.7.tgz
$ cd spark-2.4.3-bin-hadoop2.7/

Step 3: Execute Apache Spark Shell on your System

You will need JAVA installed on your Server/System to launch the Spark Shell for Spark PostgreSQL Integration:

$  yum install java

You can now execute your Spark Shell:

$ ./bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ApacheSpark1:4040
Spark context available as 'sc' (master = local[*], app id = local-1563907528854).
Spark session available as 'spark'.
Welcome to
      ____          	__
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 2.4.3
      /_/
 
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
 
scala>

The PostgreSQL JDBC Driver must be added to the Spark shell:

scala> :require /path/to/postgresql-42.2.6.jar
Added '/path/to/postgresql-42.2.6.jar' to classpath.
scala> import java.util.Properties
import java.util.Properties

Step 4: Add JDBC Driver Information in Spark

The final step in Spark PostgreSQL Integration is to add the following JDBC information for Spark to use:

scala> val url = "jdbc:postgresql://localhost:5432/testing"
url: String = jdbc:postgresql://localhost:5432/testing
scala> val connectionProperties = new Properties()
connectionProperties: java.util.Properties = {}
scala> connectionProperties.setProperty("Driver", "org.postgresql.Driver")
res6: Object = null

You have successfully set up Spark PostgreSQL Integration!

How to use Spark PostgreSQL Together? 

The following are the most common operations using while Spark PostgreSQL together. Let’s assume your PostgreSQL Cluster is already up and running. You will need a PostgreSQL 11 server running on CentOS7 for the operations below.

Set up your PostgreSQL Database

Let’s start by creating a test database on your PostgreSQL server:

postgres=# CREATE DATABASE testing;
CREATE DATABASE
postgres=# c testing
You are now connected to database "testing" as user "postgres".

Create Tables in your PostgreSQL Database

You’re going to make a new table called t1:

testing=# CREATE TABLE t1 (id int, name text);
CREATE TABLE

Insert Data into your PostgreSQL Tables

You can insert some data there:

testing=# INSERT INTO t1 VALUES (1,'name1');
INSERT 0 1
testing=# INSERT INTO t1 VALUES (2,'name2');
INSERT 0 1

You can check the data gathered:

testing=# SELECT * FROM t1;
 id | name
----+-------
  1 | name1
  2 | name2
(2 rows)

Create a Spark Dataframe for PostgreSQL Data

You can create the DataFrame as follows:

scala> val query1df = spark.read.jdbc(url, query1, connectionProperties)
query1df: org.apache.spark.sql.DataFrame = [id: int, name: string]

So, now you can do anything with this DataFrame:

scala> query1df.show()
+---+-----+
| id| name|
+---+-----+
|  1|name1|
|  2|name2|
+---+-----+
scala> query1df.explain
== Physical Plan ==
*(1) Scan JDBCRelation((SELECT * FROM t1) as q1) [numPartitions=1] [id#19,name#20] PushedFilters: [], ReadSchema: struct<id:int,name:string>

Run SQL Queries on your Spark Dataframe

Now you are ready to run SQL queries. Let’s start with query1, which will be SELECT * FROM t1, your testing table.

scala> val query1 = "(SELECT * FROM t1) as q1"
query1: String = (SELECT * FROM t1) as q1

Verify Query Outputs in PostgreSQL

You can add more values and run it again to ensure that the existing values are returned.

PostgreSQL

testing=# INSERT INTO t1 VALUES (10,'name10'), (11,'name11'), (12,'name12'), (13,'name13'), (14,'name14'), (15,'name15');
INSERT 0 6
testing=# SELECT * FROM t1;
 id |  name
----+--------
  1 | name1
  2 | name2
 10 | name10
 11 | name11
 12 | name12
 13 | name13
 14 | name14
 15 | name15
(8 rows)

Spark

scala> query1df.show()
+---+------+
| id|  name|
+---+------+
|  1| name1|
|  2| name2|
| 10|name10|
| 11|name11|
| 12|name12|
| 13|name13|
| 14|name14|
| 15|name15|
+---+------+

How to use Pushdown Operators with Dataset and Dataframes in Spark PostgreSQL? 

When the computation is organized with Dataset or DataFrame in Spark PostgreSQL, it turns out the simple filters are also passed down to the database, but not more sophisticated ones. Joins are also not pushed down.

Simple Filters:

scala> personDf.where("age != 23").explain
== Physical Plan ==
*(1) Scan JDBCRelation((select * from person) as person)
  [numPartitions=1] [name#43,age#44] PushedFilters: [*IsNotNull(age),
  *Not(EqualTo(age,23))], ReadSchema: struct<name:string,age:int>

Other filters, such as those that involve arithmetic or user-defined actions or functions, are not affected:

scala> personDf.where("age + 1 != 23").explain
== Physical Plan ==
*(1) Filter NOT ((age#44 + 1) = 23)
+- *(1) Scan JDBCRelation((select * from person) as person)
      [numPartitions=1] [name#43,age#44] PushedFilters: [*IsNotNull(age)],
      ReadSchema: struct<name:string,age:int>

On the other hand, in Spark PostgreSQL, Joins are not pulled down at all:

scala> val classDf = spark.read.jdbc(url, "(select name, class from
  class) as class", connectionProperties)
scala> val joinDf = personDf.join(classDf, "name")
joinDf: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1
  more field]
scala> joinDf.explain
== Physical Plan ==
*(5) Project [name#0, age#1, class#14]
+- *(5) SortMergeJoin [name#0], [name#13], Inner
   :- *(2) Sort [name#0 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(name#0, 200)
   :     +- *(1) Scan JDBCRelation((select * from person) as p) 
                [numPartitions=1] [name#0,age#1] PushedFilters: 
                [*IsNotNull(name)], ReadSchema: struct<name:string,age:int>
   +- *(4) Sort [name#13 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(name#13, 200)
         +- *(3) Scan JDBCRelation((select * from class) as c) 
                [numPartitions=1] [name#13,class#14] PushedFilters: 
                [*IsNotNull(name)], ReadSchema: 
                [struct<name:string,class:string>

The two tables are loaded in their whole form Postgres, and the join is performed within Spark using a sort-merge join.

Simplify PostgreSQL ETL & Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline, helps load data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, and Streaming Services and simplifies the ETL process. It supports 100+ Data Sources such as PostgreSQL,  including 40+ Free Sources. It is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. 

Hevo loads the data onto the desired Data Warehouse/destination in real-time and enriches the data and transforms it into an analysis-ready form without having to write a single line of code. Its completely automated pipeline, fault-tolerant, and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

GET STARTED WITH HEVO FOR FREE

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled securely and consistently with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

Simplify your Data Analysis with Hevo today! 

SIGN UP HERE FOR A 14-DAY FREE TRIAL!

Conclusion

This article teaches you about Spark PostgreSQL Integration. It provides in-depth knowledge about the concepts behind every step to help you understand and implement them efficiently. In case you want to transfer data into your desired Database/destination, then Hevo Data is the right choice for you! 

Visit our Website to Explore Hevo

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage Data transfer between a variety of sources and destinations with a few clicks. Hevo with its strong integration with 100+ sources & BI tools allows you to not only export Data from your desired Data sources & load it to the destination of your choice, but also transform & enrich your Data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools. 

While PostgreSQL is a good solution for Data Integration, It requires a lot of Engineering Bandwidth & Expertise. This can be challenging, resource-intensive & costly in the long run. Hevo offers a much simpler, scalable, and economical solution that allows people to create Data Pipeline without any code in minutes & without depending on Engineering teams.

Want to take Hevo for a spin? Sign Up for a 14-day free trial. You may also have a look at the amazing price, which will assist you in selecting the best plan for your requirements.

Share your experience of learning about Spark PostgreSQL Integration! Let us know in the comments section below!

mm
Former Research Analyst, Hevo Data

Harsh comes with experience in performing research analysis who has a passion for data, software architecture, and writing technical content. He has written more than 100 articles on data integration and infrastructure.

No-code Data Pipeline for PostgreSQL

Get Started with Hevo