Spark is one of the most sought all-purpose, distributed data-processing engines. It is used on a daily basis by many large organizations for use in a wide range of circumstances. Spark provides various libraries for SQL, machine learning, graph computation, and stream processing on top of Spark Processing Units which can be used together in an application. Spark MySQL integration can be performed using these libraries.

MySQL is the leading SQL-based RDBMS system developed by Oracle. It provides advanced functionality along with an easy interface of database management.

This article gives a guide on Spark MySQL integration along with examples.

What is Apache SQL?

Spark Mysql: spark sql
Apache SQL Logo

Apache Spark SQL is Spark’s structured data processing module. Get detailed information about the structure of the data and the calculations performed through the interface provided by Spark SQL. You can use this additional information to achieve additional optimizations in Apache Spark. You can work with Spark SQL in a variety of ways.

Use DataFrame and  Dataset API. When computing the results, the same execution engine is used regardless of the API / language used to represent the calculation. Therefore, users can easily switch between different APIs, providing the most natural way to represent a particular transformation.

There are three ways to use structured and semi-structured data in Apache Spark SQL.

  •  Provides DataFrame abstractions in Python, Java, and Scala to facilitate the manipulation of structured data. distributed collection of data that is organized into named columns is called a DataFrame. It provides a good optimization method.
  •  Spark SQL allows the data to be read and written in different structured formats. For example, JSON, Hive Tables,  Parquet. SQL allows you to query data both within your Spark program and from external tools. External tools connect to Spark SQL using a standard database connector (JDBC / ODBC).
  •  Spark SQL is best used in Spark applications. This allows you to use SQL to load the data and execute the query. At the same time, it can be combined with Python, Java, or Scala’s “normal” program code.

Benefits of Spark SQL

  • Integrations: Spark programs and SQL queries are combined in the Apache Spark SQL. Structured data can be queried as Distributed datasets with the help of Spark SQL. It also provides tight integration properties that allow the running of SQL queries alongside complex analytics algorithms.
  • Centralized Data Access: Spark SQL allows for loading and querying data from a variety of sources. Structured data can be worked on by a single interface with the help of schema RDD. For example, Apache Hive tables, parquet files, and JSON files.
  • Compatibility: Unmodified hive queries are run on existing warehouses in Apache Spark SQL. Hive fronted and Metastore allows for full compatibility with existing hive data, queries and UDFs
  • Standard Connectivity: ODBC and JDBC connections can be used by Apache Spark, it also includes server mode with JDBC and ODBC connectivity on industry applications.
  • Scalability: The RDD model allows for fault tolerance in large jobs and during mid queries. MDD is also used for interactive and long queries.
  • Performance Optimization: The query optimization engine in Spark SQL converts each SQL query to a logical plan. Further, it converts to many physical execution plans. Among the entire plan, it selects the most optimal physical plan for execution. 

Disadvantages of Spark SQL

  • Unsupportive Union type: Using Spark SQL, we can not create or examine a desk containing union fields.
  • No mistakes for oversize of varchar type: Even if the inserted price exceeds the scale limit, no mistakes will occur. The identical facts will truncate if examine from Hive however now no longer if examine from Spark. SparkSQL will don’t forget varchar as a string, which means there’s no length limit.
  • No assist for transactional desk: Hive transactions aren’t supported via way of means Spark SQL.
  • Unsupportive Char type: Char type (fixed-period strings) aren’t supported. Like the union, we can not examine or create a desk with such fields.
Simplify Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources (including 40+ free data sources) like Asana and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.


Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different BI tools as well.

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.

What is MySQL?

Spark Mysql: my SQL
MySQL Logo

Introduced in 1995, MySQL is an effective database management system. This DBMS uses SQL (Structured Query Language) to manipulate data and perform a variety of data-related operations. MySQL basically runs on an open-source model and can be accessed by anyone who wants to use the service. However, there are certain good versions of MySQL designed for your specific business needs. You will have to pay a monthly fee to use these MySQL version features. Due to its many features and benefits, today’s enterprises rely on the MySQL platform for scalable and reliable data solutions.

The simplicity provided by MySQL clearly has an advantage over Oracle databases and Microsoft SQL Server. In addition, you can implement any programming language for free with this DBMS platform. Another advantage of using MySQL is that it can be combined with Linux, Unix, Windows, and other operating systems. In addition, MySQL allows you to choose the deployment mode so you can use it after installing it online or on your local system.

Key features of MySQL

MySQL has become a popular DBMS on the market with the following features:

  • High Performance: The MySQL engine offers a unique combination of high processing speed and an easy-to-use interface. In addition, you can host multiple customers at the same time, allowing for faster MySQL access from anywhere.
  • Compatibility: In addition to secure, low-latency data transactions, MySQL also provides an environment for installing multiple web development tools.
  • Scalability: The MySQL platform helps you scale up or down your data load at any time. The platform also seamlessly adapts to the most well-known operating structures such as Linux, OS X, and Windows.

Learn more about MySQL.

Spark MySQL Integration

Process of Spark MySQL consists of 4 main steps.

  1. Spark MySQL: Start the spark-shell
  2. Spark MySQL: Establish a connection to MySQL DB
  3. Spark MySQL: The data frame is to be confirmed by showing the schema of the table
  4. Spark MySQL: The data is to be registered as a temporary table for future SQL queries


We will send the data present in the MySQL table to Spark using JDBC. Components are Spark 2.XX and MySQL.

  1. Spark MySQL: Create Sample Data

2. Spark MySQL: Create a table in Mysql using:

    CREATE TABLE employee (
          empno INT,
          ename VARCHAR(100),
          designation VARCHAR(100),
          manager INT,
          hire_date VARCHAR(50),
          sal INT,
          deptno INT
    INSERT INTO employee (empno, ename, designation, manager, hire_date, sal, deptno)
    VALUES (7369,'SMITH','CLERK',7902,'1980-12-17',800.00,20),


Spark Mysql: create table

3. Spark MySQL: spark code to establish the connection for Spark MySQL Integration and load data from MySQL to Spark data frame.

    import java.sql.{Connection, DriverManager, ResultSet}
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.sql.hive.HiveContext
    object ReadDataFromJdbc {
      def main(args: Array[String]): Unit = {
        val sourceTable = args(0)
        // Spark Configuration set up
        val config = new SparkConf().setAppName("Read JDBC Data: " + sourceTable)
        try {
          // JDBC connection details
          val driver = "com.mysql.jdbc.Driver"
          val url = "jdbc:mysql://localhost:3306/bdp"
          val user = "root"
          val pass = "Password"
          // JDBC Connection and load table in Dataframe
          val sourceDf ="jdbc")
            .option("driver", driver)
            .option("url", url)
            .option("dbtable", sourceTable)
            .option("user", user)
            .option("password", pass)
          // Read data from Dataframe

        } catch {
          case e : Throwable => println("Connectivity Failed for Table ", e)

By executing the code we have established a connection for Spark MySQL integration.


Spark Mysql: create connection

4. Spark MySQL: Execute Spark in the shell. We also required a MySQL connector to connect to the MySQL table.

    scala>:require /home/bdp/jars/mysql-connector-java-5.1.30-bin.jar
    scala>:load /home/bdp/codebase/ReadDataFromJdbc.scala


Spark Mysql: table in spark shell
Output While Using Hevo Data


This article gave a comprehensive guide on Spark MySQL integration, with a detailed example to help you guide through the process.

Spark and MySQL are trusted sources that a lot of companies use as it provides many benefits but transferring data from it into a data warehouse is a hectic task. The Automated data pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built integrations that you can choose from.

visit our website to explore hevo[/hevoButton]

Hevo can help you Integrate your data from numerous sources and load them into a destination to Analyze real-time data with a BI tool such as Tableau. It will make your life easier and data migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about Spark MySQL integration in the comments section below.

Arsalan Mohammed
Research Analyst, Hevo Data

Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.

No-code Data Pipeline For Your Data Warehouse