Hadoop Real-Time Processing & Streaming Simplified 101

on Big Data, Data Processing, Data Streaming, Distributed System, Hadoop, Java, Python, Ubuntu • May 27th, 2022 • Write for Hevo

Hadoop Real-Time - FI | Hevo Data

If you run a large business enterprise, you’ll need to make timely and effective business decisions to avoid losses. To do that, you have to analyze big data as quickly as possible. This is where real-time processing comes in. 

Real-time processing refers to the analysis of data immediately after they are transmitted into the database. Among the countless programs that offer real-time processing, Hadoop is one of the best options. Hadoop real-time provides a file system with a monstrous capacity and uses the MapReduce framework to analyze data. 

In this article, you’ll learn how to conduct Hadoop real-time streaming and data processing. For this, first, you will also go through the steps to install and set up a Hadoop real-time cluster.

Table of Contents

What is Hadoop? 

Hadoop Real Time: Hadoop Logo | Hevo Data
Image Source

Hadoop is a collection of free software utilities that process large amounts of data using a cluster of computers. This program performs data analysis using the Map-Reduce framework. MapReduce processes data by sorting them into queues, filtering the data to remove errors, and reducing the data until the expected output is realized. 

Using the MapReduce method, Hadoop real-time shares data collected in its Distributed File System (HDFS) across the nodes in the systems. It then transfers code into these nodes to help them process the data. Since each node focuses on processing the data allocated to it, the system can process several gigabytes of data in a few minutes.  

Thanks to the large size of the HDFS in Hadoop, the system can collect up to a few terabytes of data simultaneously. This is why Hadoop real-time streaming is perfect for big data analysis. The utility in Hadoop that facilitates data real-time processing is Hadoop Streaming

Meanwhile, while the default programming language for Hadoop Real-Time Streaming is Java, Hadoop offers an API that allows its users to write maps/reduce programs in other programming languages. 

Key Features of Hadoop

Hadoop offers various benefits that make it superior to other streaming and processing options for big data. Some of these benefits are:

  • Low Probability of Failure: Your data processing is less likely to fail if you use Hadoop real-time streaming. Although each node in the HDFS only processes a fraction of data, Hadoop real-time replicates the entire data set in all the nodes. So, when one node fails to analyze its allocated data, a copy can be retrieved and re-processed by another node. 
  • Flexibility: Hadoop supports different data types ranging from social media data to website data. Hadoop real-time also supports various programming languages like Java, PHP, C+++, etc.
  • Scalable: Hadoop is suitable for both small and large business enterprises. If you decide to expand your business, you can easily upgrade your Hadoop account to accommodate the increase in data usage. 

Replicate Data in Minutes Using Hevo’s No-Code Data Pipeline

Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources straight into your Data Warehouse or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!

GET STARTED WITH HEVO FOR FREE

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

Hadoop Real-Time Data Processing/ Streaming

Before you can stream data with Hadoop real-time processing, you must do the following:

  • Install Apache Hadoop to your system.
  • Set up a Hadoop Cluster.

How to Install Apache Hadoop

Follow these steps to install Hadoop in real-time on your system:

Step 1: Install Ubuntu Desktop

Hadoop Real Time: Ubuntu Logo | Hevo Data
Image Source

Open your web browser on your computer and type Download Ubuntu in the search bar. You’ll be taken to the Ubuntu download page. After downloading Ubuntu, open the software to access the terminal.

Step 2: Download Java

Since Hadoop is written in Java, you’ll need to download Java to your computer. Enter the command written below in your terminal to download Java:

sudo apt-get update
install default -jdk 

The default jdk will install a version of java that is compatible with Hadoop. 

Step 3: Create a New Dedicated User

Hadoop real-time data processing requires a dedicated user to function properly. This user will hold total control over Hadoop’s data folders and executables. Write the following commands to create a new dedicated user for Hadoop real-time:

Sudo addgroup hadoop
Sudo adduser  - - ingroup hadoop hduser

The first command creates a group for Hadoop files, while the second command generates a new dedicated user and assigns the user to the Hadoop group. 

Step 4: Disable ipv6 on Your Computer

The ipv6 internet protocol version comes pre-installed on most computers. However, Hadoop real-time does not support this version. So, you have to disable ipv6 before you can install Hadoop real-time. 

  • First, open an editor on your terminal:
sudo nano /etc/sysctl.conf
  • Now, enter the command for disabling ipv6:
net .ipv6 .conf .all . disable- ipv6=1
net . ipv6 . conf . default . disable-ipv6=1
net . ipv6 . conf . lo . disable-ipv6=1

After entering this command, enter the Ctrl+ X keys to save the action. Then, click on Yes on your computer to confirm that you want to save the action. 

Now, input the following command to confirm that you have successfully disabled ipv6:

cat /proc/sys/net/ipv6/conf/all

If ipv6 has been disabled, the command will return 1 as the output. 

Step 5: Download SSH and Set up SSH Certificate

Hadoop real-time needs to access SSH to manage its nodes. 

  • To use SSH, you must be able to access the 2 SSH components, which are:
    • ssh: This component is used to connect to other devices. 
    • sshd: This component allows other devices to connect to your server. 

Since ssh is already pre-installed on Ubuntu, you only need to enable sshd. Input the command below to enable sshd on your system:
sudo apt-get install ssh

  • After you’ve enabled sshd, you need to set up the SSH certificate. But first, you’ll have to switch to the dedicated user, ‘hduser’: 
su hduser
ssh – keygen  -t  rsa -P

While the first command lets you switch to the Hadoop dedicated user, the second entry will allow your account to access SSH without you needing to enter a password. 

  • Now that you’ve downloaded SSH and switched to the dedicated user, you can set up the SSH certificate. 

Enter the following command to activate the SSH certificate:

Cat $HOME/ .ssh/id_rsa.pub >> $HOME/ .ssh/authorized_keys

Step 6: Install Apache Hadoop

  • The next step is to download Apache Hadoop. You can download Hadoop by entering www.apache.org/dyn/closer.cgi/hadoop/core into your browser. 
  • Input the following command into the terminal to extract the Hadoop folder:
cd  /usr/local
  • Then, move the folder into your local directory

sudo mv  /home/hadoop1Downloads

  • Next, edit the permissions of the folder so that the dedicated user and Hadoop group can access it:
sudo chown – R hduser: hadoop hadoop
  • Update the bash file on the system to include the dedicated user for Hadoop real-time processing:
su – hduser
nano  $HOME/ .bashrc
  • Finally, add the following aliases to avoid typing long commands for the hduser in future instances. These aliases will serve as shortcuts
  • Add the following aliases at the end of the updated bash file:
port HADOOP_HOME=/usr/local/hadoop
port JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

Step 7: Configure Hadoop to Your System

  • Go to the /usr/local/hadoop/etc/hadoop /  folder and open the hadoop-env.sh file.
  • Next, set up the java variable for Hadoop real-time by typing the following command into your terminal:
/usr/lib/jvm/java-7-openjdk-i386

Press the ctrl+ X keys to save the file after entering the command. 

  • Update your Hadoop folder to include the coresite.xml file. The coresite.xml will tell where Hadoop real-time nodes are located within the cluster.

Write the command below in your terminal to configure coresite.xml:

<configuration>
<property>
<nam>fs.defaultFS< /name>
<value>hdfs: //localhost: 54310</value>
<property>
</configuration>
  • Then, configure the hdfs file. The purpose of this file is to indicate the directories in the system that will serve as namenode and datanode. Namenode manages the data in the Hadoop Distributed File System (HDFS), while the datanode stores the data based on the instructions of the namenode

The required commands for configuring the hdfs file are as follows:

<configuration>
<property>
<name>dfs.replication</name>
<value>2 <value>
</property>
<property>
<name>dfs.namenode.name,dir</name>
<value>/usr/local/hadoop/hdfs/namenode</value>
<property>
<property>
<name>dfs.datanode.data.dir</name>
<value> /usr/local/hadoop/hdfs/datanode</value>
</property>
</configuration>
  • Finally, rename and modify the mapred.site.xml.template file in the Hadoop folder (usr/local/Hadoop/etc/Hadoop).  The mapred.site.xml.template file helps Hadoop real-time track your tasks. That said, the file must be changed to the appropriate title ‘mapred.site.xml to perform its functions. 

You can configure the file after renaming. Enter the commands below to configure the mapred.site.xml file:

<configuration>
<property>
      <name>mapreduce.jobtracker.address</name>
      <value>localhost:54311</value>
     </property>
< /configuration>

What Makes Hevo’s ETL Process Best-In-Class

Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.

Check out what makes Hevo amazing:

  • Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
  • Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making. 
  • Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
  • Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Sign up here for a 14-day free trial!

How to Set Up a Hadoop Cluster

After installing and configuring Hadoop, you still need to set up a Hadoop cluster before you can start Hadoop real-time data streaming. It requires several computers, which are connected to one another, to manage a Hadoop account, and this group of computers is called a cluster.

But what if your computers are not connected to each other? You can use a Virtual Machine to simulate the existence of multiple computers on a single system. Virtual Machines like VirtualBox are compatible with Hadoop real-time data processing. Also, you’ll need to download a guest operating system, like Lubuntu. 

Take the following steps to set up a Hadoop cluster:

Step 1: Download VirtualBox

You can install VirtualBox to your system from www.virtualbox.org.

Step 2: Set Up Your Virtual Machine

  • Open the VirtualBox software and create your Virtual Machine. When setting up the machine, allocate about 15GB of storage and 2GB of memory to it. 
  • Next, download Lubuntu as the guest operating system to run the Virtual Machine. Lubuntu is available for download on Lubuntu.net. Your Virtual Machine will start working after you’ve installed Lubuntu to your system. 

Step 3: Download Guest Additions

Using a Virtual Machine requires that two operating systems co-exist in a single computer. To avoid system malfunctions in both systems, you’ll need to install guest additions. These guest additions will create shared folders to help the operating systems function properly.

Do the following to download guest additions to your computer:

  • Install DKMS to accommodate the guest additions by typing the following the command into your terminal:
sudo apt-get install dkms
  • Insert the VBox Guest Addition file (VBoxGuestAdditions.iso) into the Linux guest CD Rom Drive.
  • Now, enter the following command to activate the Guest Additions:
Sh ./VBoxLinuxAdditions.run

Step 4: Configure Your Virtual Machine’s Network Settings

The next step is to configure your Virtual Machine so that it can access information on your server. 

Here’s how to do that:

  • Click on the VirtualBox menu and select Preferences
  • Then, choose Network
  • You’ll see a list of options under Network. Click on Host-only Networks and select Add Driver. A device driver will automatically be added to the Virtual Machine.
  • Next, double-tap on the driver when it appears. The system will now prompt you to enter your server details.
  • Input your server details to continue. Then, click on Enable Server Settings.
  • Once the Virtual Machine connects to your server, go back to the Virtual Box Manager. 
  • Then, right-click on your Virtual Machine and tap on Settings.
  • The Settings tab will appear. Click on Network and choose Adapter2.
  • On the Adapter 2 tab, mark the box beside Enable Network Adapter. Then, tap on the dropdown menu beside Attached to, and click on Host-Only Adapter
  • Now, go to the Name section and click on the dropdown menu. A list of adapters will be shown.

Remember the driver you added to your VM network? That driver will be listed as one of the adapters. Tap on the driver’s name. Doing this will link your Virtual Machine to this network. 

Step 5: Clone Your Virtual Machine

Hadoop needs more than one machine to work. You will need to make several clones of your virtual machine to use the program.

Follow these steps to clone your virtual machine for Hadoop real-time processing:

  • Go to the VirtualBox menu and right-click on your Virtual Machine. The option Clone will appear. 
  • Click on Clone and change the name of your Virtual Machine to Hadoop2.
  • Next, tap on Reinitialize the MAC address of all the network cards. This will ensure that the clone’s MAC address is different from the original Virtual Machine. 
  • Click on Continue.
  • The system will now ask you to choose your clone type. Select Full Clone. Then, go to the bottom of the page and click on Clone to finish the process. 

Step 6: Assign Static IP Addresses to Your Virtual Machines

Most Virtual Machine programs change the IP addresses of the Virtual Machines from time to time. However, Hadoop is incompatible with non-static IPs because it uses the IP address to access the virtual machines. For this reason, you need to ensure that the IP addresses of your Virtual Machines always stay the same

Here’s how to keep your VM’s IP Address static: 

  • Go to the network directory for the virtual machines. Then, type in a command that edits the ‘interfaces’ file. Here are the commands for these actions:
cd /etc/networks

sudo nano interfaces

  • Then, add the following command lines at the end of the interfaces file:

For hadoop1:

auto eth1
iface eth1 inet static

address 192.168.57.121
netmast 250.250.250.0
network 192.168.57.0

For hadoop2:
Autoeth2
Iface etrh2 inet static

Address 192.168.57.122
Netmast 250.250.250.0
Network 192.168.57.0
  • Also, add the broadcast address to the command lines. You can find your broadcast address by typing the ifconfig command like this:
    ifconfig

This command will bring up a series of network addresses. Look for the one labeled BROADCAST and record it. Then, delete the ifconfig command and enter the broadcast address:

For instance, 

192.168.57.250
  • Next, go to the hosts file and edit the file
Sudo nano/etc/hosts
  • Then, add the Virtual Machines as hosts using their static IP addresses.
192.168.57.121 hadoop1
192.168.57.122 hadoop2
  • Finally, reboot the machines and run Hadoop real-time. 

Importing Data into Hadoop

After you’ve installed and configured Hadoop real-time, the next step is to import your data into the system. 

Follow the steps below to enter data into your Hadoop account:

  • Open the Namenode server by typing hdfs://localhost:8020 into your browser.
  • Format your HDFS file system by entering the following command in namenode:
$ Hadoop namenode – format
  • Initialize HDFS with the command below:
$ start -dfs.sh
  • Open an input directory in hdfs using the code below:
$ HADOOP_HOME/bin/Hadoop fs – mkdir /user/input
  • Now, transfer your data file from your local directory into Hadoop. For example:
$ HADOOP_HOME/bin/Hadoop fs 
put /home/file  /user/input

Your data file will appear in Hadoop after this, and the directory will automatically feed data into Hadoop.

Streaming on Hadoop

Now that you’ve completed the pre-requisites, you can start Hadoop real-time data streaming.

Before then, let’s explain how Hadoop Real-Time Streaming works.

Hadoop Real-Time Streaming processes data using the MapReduce framework. This framework involves 2 parts: the mapper, which takes the raw input, and the reducer, which produces the final output. So, when Hadoop real-time receives input, the mapper collects the input through stdin (standard input). Here, the lines are analyzed and released into the stdout (standard output). 

Next, the mapper executable receives the lines from the stdout, converts them into key value pairs, and sends them to the reducer. 

The reducer collects the data through stdin, where the key value pairs are converted to lines, and the lines are processed. Finally, the reducer executable changes the lines to key value pairs and releases this as the final output.

Please note that Hadoop real-time considers the first tab as key and the rest as the value in a key value pair.

Now, let’s show you how this works in practice. We’ll be using the Wordcount program and Python to do this:

  • The mapper collects data through stdin, and changes the lines to key value pairs. These key value pairs consist of each individual word as the key, and 1 as the value. To execute this function, you have to type the following code into your terminal:
#1/usr/bin/env python

import sys
for line in sys.stdin [reads the input from stdin]
line = line.strip(  )  [removes unnecessary spaces]

words  = line.split (  )   [changes lines to individual words]
for word in words [send output to stdout, which transfers data to reducer]

After typing the code, save it in the file mapper.py. Ensure that the file supports reading and execution.

  • The Reducer receives the key value input output from the mapper. Next, it converts the key value pairs back to lines and analyzes the number of times each word occurs throughout the dataset. Then it creates key value pairs where each word is the key, and the number of occurrences is the value.

The Python code for the reducer is as follows:

#/usr/bin.env python

From operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin [collects input from mapper]
line = line  . strip ( )
word . count = line.split [converts key value input back to string]

try:
count  = int(count) 
except ValueError : [convert the string to int]

if current_word = = word  [collate word count]
current_count + = count

else 
if current _word : 
print    % (current_word, current_count)   [release output]
current_count  = count
current_word = word 

Save this code in the reducer.py. The system will produce your output. 

Conclusion

In this article, you learned about Hadoop, how to install Hadoop on Ubuntu and how to set up a Hadoop cluster. You also read how you can import and stream data for Hadoop real-time data processing. Once you perform the necessary prerequisites before trying to stream your data, you should be able to conduct real-time processing without hassle. Master the Hadoop real-time processing first, and soon, you’ll be streaming data like a pro. 

Visit our Website to Explore Hevo

Integrating and analyzing your data from a huge set of diverse sources can be challenging, this is where Hevo Data comes into the picture. Hevo is a No-code Data Pipeline and has awesome 100+ pre-built integrations that you can choose from. Hevo can help you integrate your data from numerous sources and load them into a destination to analyze real-time data with a BI tool and create your Dashboards. It will make your life easier and make data migration hassle-free. It is user-friendly, reliable, and secure. 

Want to take Hevo for a spin? Sign Up here for a 14-day free trial and experience the feature-rich Hevo suite first hand.

Share your experience of learning about Hadoop Real-Time in the comments section below!

No-code Data Pipeline For your Data Warehouse