Organizations today leverage distributed frameworks for processing real-time data at scale. Apache Spark is one such open-source framework that enables real-time data processing. It uses RAM for data processing, making the data processing speed faster. Besides real-time data processing, Spark also allows users to create data models using Machine Learning and Deep Learning APIs. One such standardized Machine Learning API is MLlib, used to develop Machine Learning data models. Machine learning models are used to perform basic Statistics, Classification, and more on your data.

In this tutorial, you will learn to work with Spark Data Model using a Machine Learning model

Prerequisites

  • Basic understanding of Data Analysis.

What is Apache Spark?

Spark Logo

Developed in 2009, Apache Spark is an open-source cluster computing framework for real-time Data Processing. With cluster computing, you can download Spark on multiple machines and group them to perform tasks together. Apache Spark was developed to replace the data processing technique in Hadoop. Hadoop is an Apache-based open-source framework for Big Data processing implemented in Java. Hadoop uses a data processing technique called MapReduce, which performs data processing on disk, resulting in slow performance. Whereas in Apache Spark, Data Processing is done faster due to in-memory computation, meaning Spark will take your data in RAM, process it, and then produce quick results. Therefore, Spark is considered one of the fastest engines for large-scale Data Processing.

Spark is also defined as a unified engine for large-scale data processing, which means it can handle almost everything in the Data Science or Machine Learning workflow. It can be used to run SQL queries, create data pipelines, execute Machine Learning algorithms, work with graphs or data streams, and more.

Spark architecture consists of:

1) RDD (Resilient Distributed Dataset)

RDD is the fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes of the cluster. Internally, Spark distributes the data in RDD to different nodes to achieve parallelizations. RDD consists of two operations that are transformations and actions. Transformations on RDD are filter, map, flatmap, textFile, and more. And Actions on RDD are like collect, reduce, takeSample, take, and more.

When you are doing transformations, often, you lose your RDD due to network issues. However, the fault-tolerance feature in Spark will recalculate all your transformations and execute the actions. 

2) DAG (Directed Acyclic Graph)

DAG is used to create a logical plan for RDDs. It is a set of vertices and edges where the vertices are used to represent the RDDs, whereas the edges represent the operations in RDDs. DAG is used for the visual representation of the RDDs and their operations.

Key Features of Apache Spark

Some of the key features of Apache Spark are as follows:

1) Fast Processing Speed

Big Data processing consists of processing large volumes of complex data. Therefore, organizations focus on frameworks that can quickly process such massive data. Apache Spark can run up to 100x faster in memory and 10x faster on disk in Hadoop clusters. It relies on Resilient Distributed Datasets (RDD) that allow Spark to store data on memory transparently. 

2) Multiple Languages and Libraries

Spark enables developers to write scalable applications in Java, Scala, Python, and R. It consists of 80 high-level operators useful for making parallel applications. Spark can also interactively query data from Scala, Python, R, and SQL shells. It provides a stack of libraries like MLlib for Machine Learning. Spark also consists of SQL and DataFrames, GraphX, and Spark Streaming features. 

3) Flexibility

Spark can independently run on cluster mode and in Hadoop YARN, Apache Mesos, Kubernetes, and more. It can also read data from Hadoop data sources like HBase, HDFS, Hive, and Cassandra.

Simplify your Data Transformation Process Using Hevo’s No-Code Data Pipeline

Are you looking for an ETL tool to migrate your Google Analytics data? Migrating your data can become seamless with Hevo’s no-code intuitive platform. With Hevo, you can:

  • Automate Data Extraction: Effortlessly pull data from various sources and destinations with 150+ pre-built connectors.
  • Transform Data effortlessly: Use Hevo’s drag-and-drop feature to transform data with just a few clicks.
  • Seamless Data Loading: Quickly load your transformed data into your desired destinations, such as BigQuery.
  • Transparent Pricing: Hevo offers transparent pricing with no hidden fees, allowing you to budget effectively while scaling your data integration needs.

Try Hevo and join a growing community of 2000+ data professionals who rely on us for seamless and efficient migrations.

GET STARTED WITH HEVO FOR FREE

Getting started with the Apache Spark Data Model 

You can build the basic ML-based Spark Data Model using Apache Spark and Python. You need PySpark, a Python API for Apache Spark, an open-source distributed computing framework, and libraries for real-time, large-scale data processing. 

Follow the below steps to install PySpark on your machine.

  • Install Java 8 or higher version on your computer.
  • Install Python 3.6 from Anaconda.
  • Install PySpark using the below command in the terminal.
pip3 install pyspark
  • For testing your installation, go to the terminal and open Python. Then use the below command.
 import pyspark
  • Check the installed version of Spark using the below command.
Pyspark.__version__

1) Creating Apache Spark Machine learning (ML) Data Model

Spark’s library uses MLlib, a Machine Learning library for Machine Learning data models. To use MLlib for creating a ML-based Spark Data Model, you should know the below terminologies of MLlib.

  1. DataFrame: It is a dataset that is organized into columns. The MLlib uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types.
  2. Transformer: It is used to transform data. For instance, it can transform features into predictions. 
  3. Estimator: It is an algorithm that can fit on a DataFrame to produce a transformer. 
  4. Pipeline: It is a pipeline that chains multiple Transformers and Estimators together to specify an ML workflow.
  5. Parameter: It is the parameters specified by all the transformers and estimators through sharing a common API.

You can check more about APIs.

In this example, we will use the very popular Titanic dataset for working with Spark.

2) Load the dataset into Spark

To load the dataset into Spark, use the Spark DataFrames. Start the Spark Session using the below command.

from pyspark.sql import SparkSessionspark = SparkSession 
   .builder 
   .appName('Titanic Data') 
   .getOrCreate()
  • SparkSession.builder: This initializes a builder for creating a new SparkSession.
  • .appName('Titanic Data'): Sets the name of the Spark application to “Titanic Data”. This name will appear in the Spark UI and logs, helping to identify the application.
  • .getOrCreate(): This method either retrieves an existing Spark session or creates a new one if none exists.

When you write Spark in your notebook, you will get the below details.

Dataset loaded into Spark

The above details show that you are using Spark locally with version 3.0.0. The name of the session is ‘Titanic Data.’

Use the below command to read the .csv file of the dataset.

df = (spark.read
         .format("csv")
         .option('header', 'true')
         .load("train.csv"))
  • spark.read: This accesses the DataFrameReader from the Spark session, which is used to read data into a DataFrame.
  • .format("csv"): Specifies that the data format being read is CSV.
  • .option('header', 'true'): Indicates that the first row of the CSV file contains column headers, allowing Spark to use them as the DataFrame’s column names.
  • .load("train.csv"): Loads the CSV file named train.csv into a DataFrame named df.

To see the internals of the dataset, use the below command.

df.show(5)

It will show the top 5 rows of the dataset.

Top 5 rows of dataset
Image Source

You can check the types of columns in the dataset using df.dtypes.

Output:

Types of columns
Image Source

You can notice that most of the data types of the dataset are in string format. But, the Spark ML library only works on numeric data. You can change the data type into a numeric form using the below format.

from pyspark.sql.functions import coldataset = df.select(col('Survived').cast('float'),
                        col('Pclass').cast('float'),
                        col('Sex'),
                        col('Age').cast('float'),
                        col('Fare').cast('float'),
                        col('Embarked')
                       )dataset.show()
  • from pyspark.sql.functions import col: Imports the col function, which is used to reference DataFrame columns.
  • df.select(...): Selects specific columns from the DataFrame df.
  • col('ColumnName').cast('DataType'): Converts the specified columns to the given data type (in this case, float for numerical columns).
  • Columns Selected:
    • Survived: Cast to float
    • Pclass: Cast to float
    • Sex: Retained as is
    • Age: Cast to float
    • Fare: Cast to float
    • Embarked: Retained as is
  • dataset.show(): Displays the contents of the newly created DataFrame dataset.

Output: 

Spark data model
Image Source

To eliminate the null values in the dataset, use the below command.


dataset = dataset.replace('?', None)
       .dropna(how='any')

Since the columns ‘Sex’ and ‘Embarked’ in the Titanic dataset consist of non-numeric data types, you will have to encode them. Encode the Sex and Embarked column using StringIndexer.

from pyspark.ml.feature import StringIndexerdataset = StringIndexer(
   inputCol='Sex',
   outputCol='Gender',
   handleInvalid='keep').fit(dataset).transform(dataset)dataset = StringIndexer(
   inputCol='Embarked',
   outputCol='Boarded',
   handleInvalid='keep').fit(dataset).transform(dataset)dataset.show()
  • from pyspark.ml.feature import StringIndexer: Imports the StringIndexer class used for encoding categorical variables.
  • First StringIndexer for ‘Sex’:
    • inputCol='Sex': Specifies the column to be indexed.
    • outputCol='Gender': The new column to hold the indexed values.
    • handleInvalid='keep': Keeps invalid entries as they are instead of dropping them.
    • .fit(dataset).transform(dataset): Fits the indexer to the dataset and transforms it, adding the new ‘Gender’ column.
  • Second StringIndexer for ‘Embarked’:
    • Similar to the first, it indexes the ‘Embarked’ column to create a ‘Boarded’ column.
  • dataset.show(): Displays the resulting DataFrame after transformations.

Output:

Output
Image Source

The two new columns, ‘Boarded’ and ‘Gender,’ are created, containing the same information as the ‘Sex’ and ‘Embarked’ columns. But, now the ‘Boarded’ and ‘Gender’ columns are numeric. Delete the old columns’ Sex’ and ‘Embarked’ as you do not need them.

dataset = dataset.drop('Sex')
dataset = dataset.drop('Embarked')dataset.show()
  • dataset = dataset.drop('Sex'): This line removes the ‘Sex’ column from the dataset DataFrame.
  • dataset = dataset.drop('Embarked'): This line removes the ‘Embarked’ column from the dataset DataFrame.
  • dataset.show(): Displays the resulting DataFrame after the specified columns have been dropped.

Output: 

Output
Image Source

Spark performs prediction on a column where all the dataset attributes are stored in the list.

Dataset attributes
Image Source

You can predict the class label as Survived with the attributes like Pclass, Age, Fare, Gender, and Boarded into one column using Spark, as shown below.

Dataset attributes
Image Source

You can convert all the attributes in one column with VectorAssembler.

Use the below command to convert all the attributes to a single column.

required_features = ['Pclass',
                   'Age',
                   'Fare',
                   'Gender',
                   'Boarded'
                  ]from pyspark.ml.feature import VectorAssemblerassembler = VectorAssembler(inputCols=required_features, outputCol='features')transformed_data = assembler.transform(dataset)
  • required_features: A list that specifies the columns from the dataset that will be used as features for the machine learning model.
  • VectorAssembler: This class is used to combine multiple feature columns into a single vector column named ‘features’.
  • assembler.transform(dataset): Applies the transformation to the dataset, resulting in a new DataFrame (transformed_data) that includes the ‘features’ column containing the combined values from the specified feature columns.

Output:

Output
Image Source

3) Data Modelling

Split the dataset into training and testing using the below command.

(training_data, test_data) = transformed_data.randomSplit([0.8,0.2])

In the Data Modelling process, you need to build and fit an ML model to the Titanic dataset. You will be using Random Forest Classifier for performing classification on the Titanic dataset.

Use the Random Forest Classifier on the Titanic dataset using the below command.

from pyspark.ml.classification import RandomForestClassifierrf = RandomForestClassifier(labelCol='Survived',
                           featuresCol='features',
                           maxDepth=5)
  • RandomForestClassifier: This is a machine learning model that uses an ensemble of decision trees to make predictions. It is particularly useful for classification tasks.
  • labelCol='Survived': Specifies that the target variable for classification is in the column named ‘Survived’. This column contains the actual labels (0 or 1) that the model will try to predict.
  • featuresCol='features': Indicates that the features used for making predictions are stored in the column named ‘features’, which should contain the feature vector created earlier.
  • maxDepth=5: This parameter sets the maximum depth of each tree in the Random Forest. Limiting the depth can help prevent overfitting.

4) Fit the Model

model = rf.fit(training_data)

It will give us a transformer that can be used to predict the test dataset.

predictions = model.transform(test_data)

Check the accuracy of your Spark data model.

from pyspark.ml.evaluation import MulticlassClassificationEvaluatorevaluator = MulticlassClassificationEvaluator(
   labelCol='Survived',
   predictionCol='prediction',
   metricName='accuracy')
  • MulticlassClassificationEvaluator: This is a class used to evaluate the performance of classification models on multiclass classification tasks.
  • labelCol='Survived': Specifies the column in the dataset that contains the true labels (0 or 1) for the classification task.
  • predictionCol='prediction': Indicates the column that contains the predictions made by the model.
  • metricName='accuracy': This parameter sets the metric used to evaluate the model’s performance.

Output:

accuracy = evaluator.evaluate(predictions)
print('Test Accuracy = ', accuracy) -> 0.843

This is how you can start working with Spark Data Model!

What Makes Hevo’s Data Transformation Process Best-In-Class

Transforming data can be a mammoth task without the right set of tools. Hevo’s automated platform empowers you with everything you need to have a smooth Data Collection, Processing, and Integration experience. Our platform has the following in store for you!

  • Exceptional Security: A Fault-tolerant Architecture that ensures Zero Data Loss.
  • Built to Scale: Exceptional Horizontal Scalability with Minimal Latency for Modern-data Needs.
  • Built-in Connectors: Support for 100+ Data Sources, including Databases, SaaS Platforms, Files & More. Native Webhooks & REST API Connector available for Custom Sources.
  • Blazing-fast Setup: Straightforward interface for new customers to work on, with minimal setup time.
  • Data Transformations: Best-in-class & Native Support for Complex Data Transformation at fingertips. Code & No-code Flexibility is designed for everyone.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Auto Schema Mapping: Hevo takes away the tedious task of schema management & automatically detects the format of incoming data and replicates it to the destination schema.
Sign up here for a 14-Day Free Trial!

Conclusion

In this tutorial, you learned to work with Spark Data Model using a Machine learning algorithm, i.e., Random Forest with the Titanic dataset. You can explore other machine learning algorithms like decision trees, linear regression, and more with Apache Spark. Apache Spark is mainly used for applications or businesses that produce extensive data like banking, e-commerce, healthcare, entertainment, and more.

In case you want to export data from a source of your choice into your desired Database/destination then Hevo Data is the right choice for you! 

Hevo Data, a No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer between a variety of sources and a wide variety of Desired Destinations, with a few clicks. Hevo Data with its strong integration with 150+ sources (including 60+ free sources) allows you to not only export data from your desired data sources & load it to the destination of your choice, but also transform & enrich your data to make it analysis-ready so that you can focus on your key business needs and perform insightful analysis using BI tools.

Want to take Hevo for a spin?

Sign up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You can also have a look at the unbeatable pricing that will help you choose the right plan for your business needs.

Share your experience of learning about Spark Data Model. Let us know in the comments section below!

FAQs

1. What is data modeling in Spark?

Data modeling in Spark involves designing schemas and organizing data structures to efficiently process and analyze large datasets. Using Spark’s distributed computing framework, data can be modeled in various formats like structured tables (via DataFrames), semi-structured data (using Datasets), or unstructured data, enabling scalable and optimized data processing.

2. What is the Spark model?

The Spark model refers to Apache Spark’s distributed computing framework, which processes large-scale data across a cluster of machines. It uses a resilient distributed dataset (RDD) as its core abstraction and supports high-level APIs for DataFrames and Datasets, enabling scalable, in-memory data processing for batch and real-time analytics.

3. Is Spark an ETL?

Apache Spark is not exclusively an ETL tool but is widely used for ETL processes due to its powerful data processing capabilities. It can extract, transform, and load large-scale data efficiently using its distributed computing framework, supporting batch and real-time ETL workflows.

Manjiri Gaikwad
Technical Content Writer, Hevo Data

Manjiri is a proficient technical writer and a data science enthusiast. She holds an M.Tech degree and leverages the knowledge acquired through that to write insightful content on AI, ML, and data engineering concepts. She enjoys breaking down the complex topics of data integration and other challenges in data engineering to help data professionals solve their everyday problems.