Working with WEBHDFS Rest API

on Apache Spark, API, Big Data, REST API, Tutorials • October 18th, 2021 • Write for Hevo

WEBHDFS_Featured Image

Apache Hadoop has native libraries that can be used to access HDFS. These are good for applications running inside the Hadoop cluster as they can use the libraries to access HDFS. However, external applications may need to interact with HDFS. These applications may need to create directories, write files to the new directories, and even read the contents of files stored in HDFS. Hadoop’s native libraries cannot facilitate this. That’s why you need the WEBHDFS REST API. It facilitates interactions between external applications and HDFS data.

Web services are popular today as they facilitate the exchange of data across applications. These are becoming even more important as applications are generating more data. A number of APIs (Application Programming Interfaces) have been developed to expose web services. REST (REpresentational State Transfer) is the most popular standard for developing APIs. 

HDFS (Hadoop File System) is one of the major components of Hadoop. It is a distributed file system designed and developed to run on commodity hardware. HDFS can scale a single Hadoop cluster into hundreds or thousands of nodes. This facilitates the faster processing of large data sets. 

In this article, we will be discussing how to use this REST API. You will learn what it is and how to use it to perform different operations. 

Prerequisites

This is what you need for this article:

  • Apache Hadoop.

Table of contents

Part 1: Understanding WEBHDFS REST API

WEBHDFS: Hadoop logo
Image Source: allvectorlogo

Hadoop has a native Java API that supports file system operations like create, delete, and rename files and directories, open, write, and read files, set permissions, and more. This native Java API is good and very useful to applications that run inside the Hadoop cluster. However, in cases where external applications need to run operations such as write files and create directories to that directory, there will be a need for a special API. Hortonworks built an API to offer these features based on the standard REST functionalities. 

WEBHDFS is a REST API that supports HTTP operations like GET POST, PUT, and DELETE. It allows client applications to access HDFS data and execute HDFS operations via HTTP or HTTPs. It offers the following features:

  • Read and write access- This REST API supports all HDFS operations including granting permissions, accessing block location, configuring replication factor, and more. 
  • HDFS parameters- It supports all HDFS parameters and their default values.
  • Authentication- This REST API uses Hadoop and Kerberos to authenticate requests. Kerberos is used when security is turned on. 
  • Multiple languages- It allows clients to access HDFS using different languages without the need to install Hadoop. It can also be used together with tools like wget and curl to access HDFS. 
  • Open-source- It is a completely open-source tool. You can use it without paying anything. 

In the next section, we will be discussing how to accomplish various tasks in this REST API. 

Simplify your Data Analysis with Hevo’s No-code Data Pipeline

Hevo Data, a No-code Data Pipeline helps to Load Data from any data source such as Databases, SaaS applications, Cloud Storage, SDK,s, its and Streaming Services and simplifies the ETL process. It supports 100+ data sources and loads the data onto the desired Data Warehouse, enriches the data, and transforms it into an analysis-ready form without writing a single line of code.

Its completely automated pipeline offers data to be delivered in real-time without any loss from source to destination. Its fault-tolerant and scalable architecture ensure that the data is handled in a secure, consistent manner with zero data loss and supports different forms of data. The solutions provided are consistent and work with different Business Intelligence (BI) tools as well.

Get Started with Hevo for Free

Check out why Hevo is the Best:

  • Secure: Hevo has a fault-tolerant architecture that ensures that the data is handled in a secure, consistent manner with zero data loss.
  • Schema Management: Hevo takes away the tedious task of schema management & automatically detects the schema of incoming data and maps it to the destination schema.
  • Minimal Learning: Hevo, with its simple and interactive UI, is extremely simple for new customers to work on and perform operations.
  • Hevo Is Built To Scale: As the number of sources and the volume of your data grows, Hevo scales horizontally, handling millions of records per minute with very little latency.
  • Incremental Data Load: Hevo allows the transfer of data that has been modified in real-time. This ensures efficient utilization of bandwidth on both ends.
  • Live Support: The Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
  • Live Monitoring: Hevo allows you to monitor the data flow and check where your data is at a particular point in time.
Sign up here for a 14-Day Free Trial!

Part 2: How to Work with WEBHDFS REST API

In this section, we will be discussing how to use this REST API. 

Ordinary operations such as creating directories, listing directories, opening files, opening directories, deleting files, deleting directories, and more are straightforward. You have to specify the correct operator for op=<operation_type> in your WebHDFS URL. 

Enabling this REST API

To start using this REST API, you should first enable it. The good news is that only a simple configuration is needed to enable it. You only have to add the following property to the hdfs-site.xml file:

<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Checking Directory Status

You can use the curl tool to invoke this REST API. You can combine the curl command to check the status of a directory as shown below:

curl -i "http://localhost:50070/webhdfs/v1/tmp?user.name=ncsam&op=GETFILESTATUS"

If everything is okay, the request will return the success status code of 200. 

Creating a Directory

You can use the PUT request to create a new directory using it. The following command demonstrates this:

curl -i -X PUT "http://localhost:50070/webhdfs/v1/tmp/webhdfs?user.name=ncsam&op=MKDIRS"

If the directory is created successfully, you will get a success status code of 200.

Creating a File

You can also use this REST API to create a new file in HDFS. This requires you to follow two steps. You should first run a command against the namenode and then run this API against the right datanode. 

Let’s run the first command:

curl -i -X PUT "http://localhost:50070/webhdfs/v1/tmp/webhdfs/webhdfs-test.txt?user.name=ncsam&op=CREATE"

The above command will be followed by a redirection. The second command should create a new file and it should be as follows:

curl -i -T webhdfs-test.txt "http://ncsam-pc:50075/webhdfs/v1/tmp/webhdfs/myfile.txt?op=CREATE&user.name=ncsam&overwrite=false"

The command will create a new file named myfile.txt. To confirm whether this API created the file successfully, we can run the Hadoop filesystem command shown below:

bin/hadoop fs -ls /tmp/webhdfs

You should find the file myfile.txt in the directory. 

Opening and Reading a File

To open and read an HDFS file using this API, we should use the curl command with the -L option to follow the temporary HTTP redirect URL. The following command demonstrates this:

curl -i -L "http://localhost:50070/webhdfs/v1/tmp/webhdfs/myfile.txt?op=OPEN&user.name=ncsam"

The request will return the details of the details as well as the contents of the file. 

Renaming a Directory

To rename an HDFS directory, we can use the curl command together with the -i and -X options. The following request demonstrates this:

curl -i -X PUT "http://localhost:50070/webhdfs/v1/tmp/webhdfs?op=RENAME&user.name=ncsam&destination=/tmp/new-webhdfs"

You can confirm whether the renaming was done successfully by running the following Hadoop filesystem command:

bin/hadoop fs -ls /tmp

You will see that the directory /tmp/webhdfs has been renamed to /tmp/new-webhdfs. 

Deleting a Directory

HDFS does not allow you to delete a non-empty directory. If you try to delete a directory with contents, you will get an exception and it won’t be deleted. 

Thus, we should first delete the file in the directory and we will be able to delete the directory. Let’s first delete the file in the directory:

curl -i -X DELETE "http://localhost:50070/webhdfs/v1/tmp/new-webhdfs/myfile.txt?op=DELETE&user.name=ncsam"

The above command will delete the file named myfile.txt. Next, let’s issue the following request to delete the directory:

curl -i -X DELETE "http://localhost:50070/webhdfs/v1/tmp/new-webhdfs?op=DELETE&user.name=ncsam&destination=/tmp/new-webhdfs"

That is how to perform different operations on HDFS using this REST API.

Conclusion

From this article, you would have learnt more about WebHDFS and its different operations available.

Visit our Website to Explore Hevo

Hevo Data will automate your data transfer process, hence allowing you to focus on other aspects of your business like Analytics, Customer Management, etc. This platform allows you to transfer data from 100+ multiple sources like Salesforce for free to Cloud-based Data Warehouses like Snowflake, Google BigQuery, Amazon Redshift, etc. It will provide you with a hassle-free experience and make your work life much easier.

Want to take Hevo for a spin? Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand.

No-code Data Pipeline for Your Data Warehouse