Hadoop Cluster on AWS

Akash Pandey
4 min readOct 11, 2020
Apache Hadoop

What is Hadoop ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop’s four core elements are :-

  • Hadoop Distributed File System ( HDFS )
  • Hadoop MapReduce
  • Hadoop YARN
  • Hadoop Common

What is Hadoop Cluster ?

Hadoop cluster , an extraordinary computational system , is a combination of many computers designed to Store , Optimize and Analyse Petabytes of data , with astonishing Agility. It is designed to perform Big-data analysis and also to store and manage huge amounts of data. It is a collection of commodity hardware interconnected with each other and working together as a single unit.

Hadoop Cluster

What is Big Data and It’s Importance ? Why Hadoop ?

Advantage of Hadoop Cluster ?

  • Scalable
  • Cost-effective
  • Flexible
  • Fast
  • Resilent to Failure

Here , I set-up Hadoop Cluster on AWS Cloud . So , we don’t have to provide our resources, I can use the Storage and Compute unit from AWS Cloud and pay only for what i use . This results in Cost-effective and Scalable .

How to Launch Instance on AWS Cloud ?

How to Set-up Hadoop Cluster ?

Here , I Proved two things -

1st

✴️ Whenever client uploads the file ( for ex — f.txt) of size 50MB and the replication is 3.

✴️ Does client takes the entire data to master or does master provides the IP addresses of Datanodes so that client can upload the file to the Datanodes.

✴️ Who is the one uploading the file?

✴️ Answer: Client gets the IP from Master and uploads the file to DataNode .

2nd

✴️ Does client go to master and then read the file on slave via Master or Does Client go to slave directly and read the data?

✴️ Answer: Client goes to slave directly and reads the data stored on slave.

Here , I set-up Hadoop Cluster by creating 8 Data Node , 1 Client and Namenode .

Here , Client Upload the data whose size is more than 50MB , and it stored in three different datanode and creating by default three replicas as shown in figure.

Here , When Client Read the file , He directly goes to slave node and read the file .

When the DataNode crashed in which data of the client are stored , It does not mean that data also gets delete . Hadoop created replicas of all the data stored by Clients which makes the data durable and available . Even except 1 , all other datanode gets crashed , Data of Client stored in a single datanode automatically .

This shows only one datanode is available , and all other datanode gets crashed .

Still , data is available . This means that data never deleted even the Computer/system in which the data previously stored are deleted .

Thank You :)

--

--

Akash Pandey

I am a Computer Science Undergraduate , who is seeking for opportunity to do work in challenging work environment .