Hadoop Internals


Fork me on GitHub

Hadoop Architecture Overview


Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. There are mainly five building blocks inside this runtime environment (from bottom to top):

Hadoop Architecture Oveview

The YARN infrastructure and the HDFS federation are completely decoupled and independent: the first one provides resources for running an application while the second one provides storage. The MapReduce framework is only one of many possible framework which runs on top of YARN (although currently is the only one implemented).

YARN: Application Startup

YARN Architecture

In YARN, there are at least three actors:

The application startup process is the following:

  1. a client submits an application to the Resource Manager
  2. the Resource Manager allocates a container
  3. the Resource Manager contacts the related Node Manager
  4. the Node Manager launches the container
  5. the Container executes the Application Master

Yarn: Application Startup

The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

The Resource Manager is a single point of failure in YARN. Using Application Masters, YARN is spreading over the cluster the metadata related to running applications. This reduces the load of the Resource Manager and makes it fast recoverable.