Businesses around the globe have realised the importance of collecting and analysing data to make informed and data-driven decisions for better growth. Each second a colossal amount of data is generated whose real-time analysis can give you an edge over your competitors. To streamline this process, you can opt for Apache Spark. It provides an Easy-to-Use, Fast, Scalable, and Unified Engine for large-scale Data Analytics.
Apache Spark has especially been a popular choice among developers as it allows them to build applications in various languages such as Java, Scala, Python, and R. You can easily submit Spark REST API Jobs and enjoy the processing capabilities of the analytics engine powered by Apache Spark. Using the Spark REST API, you can run an application that can perform various jobs associated with Data Science, Data Transformation, or Machine Learning.
In this article you will learn how to effectively submit a Spark REST API job, check the job status, and delete it.
Introduction to Apache Spark
Apache Spark is an Open-Source, lightning-fast Distributed Data Processing System for Big Data and Machine Learning. It was originally developed back in 2009 and was officially launched in 2014. Attracting big enterprises such as Netflix, eBay, Yahoo, etc, Apache Spark processes and analyses Petabytes of data on clusters of over 8000 nodes. Utilizing Memory Caching and Optimal Query Execution, Spark can take on multiple workloads such as Batch Processing, Interactive Queries, Real-Time Analytics, Machine Learning, and Graph Processing.
Spark was made to overcome the challenges faced by developers with MapReduce, the disk-based computational engine at the core of early Hadoop clusters. Unlike MapReduce, Spark reduces all the intermediate computationally expensive steps by retaining the working dataset in memory until the job is completed. It has become a favorite among developers for its efficient code allowing them to write applications in Scala, Python, Java, and R. With Built-in parallelism and Fault Tolerance, Spark has assisted businesses to deliver on some of the cutting edge Big Data and AI use cases.
Key Features of Apache Spark
Over the years Spark has evolved and provided rich features to make Data Analytics a seamless process. Some of its salient features are as follows:
- Accelerated Processing Capabilities: Spark processes data across Resilient Distributed Datasets (RDDs) and reduces all the I/O operations to a greater extent when compared to MapReduce. It performs 100x faster in-memory, and 10x faster on disk. It’s also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines.
- Ease-to-use: Spark supports a wide variety of programming languages to write your scalable applications. It also provides 80 high-level operators to comfortably design parallel apps. Adding to its user-friendliness, you can even reuse the code for batch-processing, joining streams against historical data or running ad-hoc queries on stream state.
- Advanced Analytics: Spark can assist in performing complex analytics including Machine Learning and Graph processing. Spark’s brilliant libraries such as SQL & DataFrames and MLlib (for ML), GraphX, and Spark Streaming have seamlessly helped businesses tackle sophisticated problems. You also get better speed for analytics as Spark stores data in the RAM of the servers which is easily accessible.
- Fault Tolerance: Owing to the Sparks RDDs, Apache Spark can handle the worker node failures in your cluster preventing any loss of data. All the transformations and actions are continuously stored, thereby allowing you to get the same results by rerunning all these steps in case of a failure.
- Real-Time Processing: Unlike MapReduce where you could only process data present in the Hadoop Clusters, Spark’s language-integrated API allows you to process and manipulate data in real-time.
Spark has an ever growing community of developers from 300+ countries that has constantly contributed towards building new features improving Apache Spark’s performance. To reach out to them, you can visit the Spark Community page.
Introduction to REST APIs
An Application programming interface(API) is a set of guidelines that define how devices or applications link up and communicate with each other. The API describes the correct way for you to write a program that requests services from an operating system or other application.
A REST API is an API that follows the design rules of REST (Representational State Transfer Architectural Style). This is generally employed for Web APIs that use HTTP requests to access and use data.
Spark REST APIs can be built with various programming languages such as Javascript, Python, etc. Compared to SOAP APIs, Spark REST APIs can return XML, JSON, YAML or any other format depending on what the client requests. Spark REST APIs consume less bandwidth, thereby becoming a faster and lightweight choice for Internet usage.
Hevo Data is a No-code Data Pipeline that offers a fully managed solution to set up Data Integration for 100+ Data Sources (Including 40+ Free Sources such as REST APIs) and will let you directly load data from your sources to a Data Warehouse or the Destination of your choice. Hevo provides a Native REST API connector that allows loading data from non-native or custom sources for free automating your data flow in minutes without writing any line of code. Its fault-tolerant architecture makes sure that your data is secure and consistent. Hevo provides you with a truly efficient and fully automated solution to manage data in real-time and always have analysis-ready data.
Get Started with Hevo for Free
Let’s look at some of the salient features of Hevo:
- Fully Managed: It requires no management and maintenance as Hevo is a fully automated platform.
- Data Transformation: It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Real-Time: Hevo offers real-time data migration. So, your data is always ready for analysis.
- Schema Management: Hevo can automatically detect the schema of the incoming data and maps it to the destination schema.
- Connectors: Hevo supports 100+ Integrations (Including 40+ Free Sources such as REST APIs) with SaaS platforms, Files, Databases, Analytics, and BI tools. It supports various destinations including Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3 Data Lakes; and MySQL, SQL Server, TokuDB, DynamoDB, PostgreSQL Databases to name a few.
- Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
- Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
- Live Monitoring: Advanced monitoring gives you a one-stop view to watch all the activities that occur within pipelines.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-Day Free Trial!
Prerequisites
- Knowledge of APIs.
- Knowledge of Apache Spark.
Steps to set up Spark REST API job
You can run a Spark REST API job by using the Spark Standalone mode. One of the REST API rules states that you will get a piece of data (resource) when linked to a URL (endpoint). You can send requests by using the command line utility called cURL. There are 4 types of request methods that you can set in curl by writing -X or –request, followed by the request method:
- GET: It performs a READ operation, where the server searches for the data you requested and sends it back to you.
- POST: This performs a CREATE operation, where the server creates a new entry in the database and informs you if the process was successful.
- PUT or PATCH: These requests execute an UPDATE operation, where the server updates a data entry in the database and intimates you if the update was done correctly.
- DELETE: This request performs a DELETE operation, where the server deletes a data entry in the database and notifies you if the job was successful.
To understand the functioning of the SPARK REST API, there are the following 3 critical aspects:
Step 1: Submit a Spark REST API Job
By following the easy steps given below you can run a Spark REST API Job:
- Step 1: Firstly you need to enable the REST API service by adding the below configuration on spark-defaults.conf file.
- Step 2: Restart the service to complete the enabling process.
./sbin/start-master.sh
./sbin/start-slave.sh spark://192.168.1.1:7077
- Step 3: To make sure the cluster is working, access the below URL with your IP address.
http://192.168.2.1:8080
- Step 4: Now, to submit your application request to the cluster you can use the REST API /v1/submissions/create. For that, you need to enter the class you want to run for mainClass, appArgs for any command-line arguments, and the location of the .Jar file with appResource. Using the curl -X Post command, you can send a create request to the server.
curl -X POST http://192.168.1.1:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "/home/hduser/sparkbatchapp.jar",
"sparkProperties": {
"spark.executor.memory": "8g",
"spark.master": "spark://192.168.1.1:7077",
"spark.driver.memory": "8g",
"spark.driver.cores": "2",
"spark.eventLog.enabled": "false",
"spark.app.name": "Spark REST API - PI",
"spark.submit.deployMode": "cluster",
"spark.jars": "/home/user/spark-examples_versionxx.jar",
"spark.driver.supervise": "true"
},
"clientSparkVersion": "2.4.0",
"mainClass": "org.apache.spark.examples.SparkPi",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"action": "CreateSubmissionRequest",
"appArgs": [
"80"
]
}'
After successfully submitting the request, it returns a response containing the application ID.
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20200923223841-0001",
"serverSparkVersion" : "2.4.0",
"submissionId" : "driver-20200923223841-0001",
"success" : true
}
Step 2: Check the Spark REST API Job Status
You can check the status of your Spark REST API job by mentioning your application ID in the URL. You can also use the GET method in the curl statement for checking the job status.
After running the status request, you may get the following status responses:
- Waiting: When the Job is created and waiting for the resources to be allocated.
- Running: The Job is up and running.
- Failed: The request was submitted successfully but the application execution failed.
- Finished: The request was submitted and the application was successfully executed.
- Unknown: Job submission was successful but an error occurred while getting the state of the application.
{
"action" : "SubmissionStatusResponse",
"driverState" : "FINISHED",
"serverSparkVersion" : "2.4.0",
"submissionId" : "driver-20200923223841-0001",
"success" : true,
"workerHostPort" : "192.168.1.1:38451",
"workerId" : "worker-20200923223841-192.168.1.2-34469"
}
Step 3: Delete a Spark REST API Job
Sometimes you may require to kill a Spark REST API job. You can either use the POST command to create a kill request or use the DELETE command to delete the job. In this example, the POST command is used with the curl statement.
You may get the following response:
{
"action" : "KillSubmissionResponse",
"message" : "Kill request for driver-20200923223841-0001 submitted",
"serverSparkVersion" : "2.4.0",
"submissionId" : "driver-20200923223841-0001",
"success" : true
}
Conclusion
In this article, you have learned how to set up a Spark REST API Job, check the job status and delete it. Spark provides a unified platform for all your Data Science, Machine Learning, and Streaming Data Real-Time analysis requirements. Its Fault-Tolerant Architecture, High Processing Speed, and User Friendliness are the core features that attract developers to build their applications. Spark’s standalone cluster mode can be effectively used to run all your Spark API Jobs comfortably.
To get a clear understanding of the Spark data model, check out our comprehensive guide on the Spark data model. It covers essential concepts and practical insights.
Today, businesses use a minimum of 10 SaaS applications across various departments. Given the exponential rate at which data is generated, efficiently handling this massive volume of data from all these sources by manually connecting to these ever-evolving individual APIs can be a challenging task. You would require to invest a section of your Engineering bandwidth to efficiently Integrate, Clean, Transform and Load your data into a Data Warehouse for further business analysis. All of this can be comfortably automated by a Cloud-Based ETL Tool like Hevo Data.
Visit our Website to Explore Hevo
Hevo Data, a No-code Data Pipeline can seamlessly transfer your data from a vast sea of sources like REST APIs into your Data Warehouse or a destination of your choice for free. It is a reliable, secure, and fully automated service that does not require you to write any code!
If you are using REST APIs in your business and looking for a stress-free alternative to Manual Data Integration, then Hevo can effortlessly automate this for you. Hevo provides a Native REST API connector that allows loading data from non-native or custom sources for free, without writing any code. Hevo with its strong integration with 100+ Sources & BI tools (Including 40+ Free Sources such as REST APIs), allows you to not only export & load Data but also transform & enrich your Data & make it analysis-ready.
Want to take Hevo for a ride? Sign Up for a 14-day free trial and simplify your Data Integration process. Check out the pricing details to get a better understanding of which plan suits you the most.
Share with us your experience of setting up the Spark REST API jobs. Let us know in the comments section below!
Frequently Asked Questions
1. What is Spark rest API?
The Spark REST API provides a way to interact with a Spark cluster programmatically.
2. How to call rest API from PySpark?
Use Python libraries like requests
to make HTTP requests and process the responses within a PySpark application.
3. What is meant by Spark API?
The Spark API refers to the set of interfaces and libraries provided by Apache Spark for interacting with its features and performing data processing tasks.
Sanchit Agarwal is an Engineer turned Data Analyst with a passion for data, software architecture and AI. He leverages his diverse technical background and 2+ years of experience to write content. He has penned over 200 articles on data integration and infrastructures, driven by a desire to empower data practitioners with practical solutions for their everyday challenges.