Hadoop Cluster on AWS

4 min readOct 11, 2020

What is Hadoop ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop’s four core elements are :-

Hadoop Distributed File System ( HDFS )
Hadoop MapReduce
Hadoop YARN
Hadoop Common

What is Hadoop Cluster ?

Hadoop cluster , an extraordinary computational system , is a combination of many computers designed to Store , Optimize and Analyse Petabytes of data , with astonishing Agility. It is designed to perform Big-data analysis and also to store and manage huge amounts of data. It is a collection of commodity hardware interconnected with each other and working together as a single unit.

What is Big Data and It’s Importance ? Why Hadoop ?

BIG DATA

What is Big Data ?

medium.com

Advantage of Hadoop Cluster ?

Scalable
Cost-effective
Flexible
Fast
Resilent to Failure

Here , I set-up Hadoop Cluster on AWS Cloud . So , we don’t have to provide our resources, I can use the Storage and Compute unit from AWS Cloud and pay only for what i use . This results in Cost-effective and Scalable .

How to Launch Instance on AWS Cloud ?

How to Launch EC2 Instance on AWS Cloud

Click Launch Instance

medium.com

How to Set-up Hadoop Cluster ?

How to Set-up Hadoop Cluster

What is Hadoop Cluster and Big-data ?

medium.com

Here , I Proved two things -

1st

✴️ Whenever client uploads the file ( for ex — f.txt) of size 50MB and the replication is 3.

✴️ Does client takes the entire data to master or does master provides the IP addresses of Datanodes so that client can upload the file to the Datanodes.

✴️ Who is the one uploading the file?

✴️ Answer: Client gets the IP from Master and uploads the file to DataNode .

2nd

✴️ Does client go to master and then read the file on slave via Master or Does Client go to slave directly and read the data?

✴️ Answer: Client goes to slave directly and reads the data stored on slave.

bandicam 2020-10-02 17-42-33-500.mp4

Edit description

drive.google.com

Here , I set-up Hadoop Cluster by creating 8 Data Node , 1 Client and Namenode .

Here , Client Upload the data whose size is more than 50MB , and it stored in three different datanode and creating by default three replicas as shown in figure.

Here , When Client Read the file , He directly goes to slave node and read the file .

When the DataNode crashed in which data of the client are stored , It does not mean that data also gets delete . Hadoop created replicas of all the data stored by Clients which makes the data durable and available . Even except 1 , all other datanode gets crashed , Data of Client stored in a single datanode automatically .

This shows only one datanode is available , and all other datanode gets crashed .