written 8.5 years ago by |
Hadoop Cluster:
Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster. In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.
Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them.
Large Hadoop Clusters are arranged in several racks. Network traffic between different nodes in the same rack is much more desirable than network traffic across the racks.
Core Components of Hadoop Cluster:
Hadoop cluster has 3 components:
- Client
- Master
- Slave
The role of each components are shown in the below image.
Let's try to understand these components one by one:
1 Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the data should be processed and then retrieve the data to see the response after job completion.
2 Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
2.1 NameNode:
NameNode does NOT store the files but only the file's metadata. In later section we will see it is actually the DataNode which stores the files.
Name node keeps track of all the file system related information such as to
Which section of file is saved in which part of the cluster
Last access time for the files
User permissions like which user have access to the file
2.2 JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
2.3 Secondary Node
The job of Secondary Node is to contact NameNode in a periodic manner after certain time interval (by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't have any backup of filesystem. What secondary node does is it contacts NameNode in an hour and pulls copy of metadata information out of NameNode. It shuffle and merge this information into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.
3 Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
Store the data
Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a slave to the NameNode.