Hadoop, why I need it

I graduate in computer science with Artificial Intelligence specialization, however during my academic path I was never taught how to use the most common frameworks and environment. When I stared looking for my first job, I immediately went for machine learning and data analysis junior positions. I was sure to be qualified for them, however the truth is that probably I’m not (yet). Most of positions are strongly related to the so called Industry 4.0 as a new way to manage industrial machines with IoT and Big Data technologies. This is when my attention focused on a couple names that I maybe heard only once or twice so far: Hadoop and Spark.

Big Data do not necessarily relate to machine learning but you usually end up with running prediction and learning algorithms on them. Conducting Big Data analysis is almost never faisable on a single machine and instead requires many of them, and here is where Hadoop makes his entrance.


Hadoop logo
Hadoop is an Apache Software Foundation framework that is used for distributed storage management and processing.

It provides data reliability, availability and consistency by using a distributed file system called HDFS (Hadoop Distributed FileSystem). HDFS works by splitting, replicating and distributing the data across the cluster. It is also smart: thanks to its “rack-awareness” it can avoid too high network traffic by processing the data “near” to where they are stored. This is possible because Hadoop understands the cluster topology. While HDFS is the suggested file system, Hadoop runs also on other file systems such as Amazon S3.

Another key component is the resource manager that schedule the user applications that run on top of Hadoop itself (such as Spark applications) by assigning cluster resources.

The programming model proposed by Hadoop is MapReduce, a model that define a parallel and distributed way to manage large amount of data (I’ll probably focus on it in a future post).
In the first Hadoop version, MapReduce was part of the resource manager, on top of HDFS. However in Hadoop 2 a more modular approach has been introduced. The resource manager, now called YARN stays on top of HDFS while MapReduce is on top of YARN. YARN is then the resource manager that schedule all the tasks in Hadoop 2 (and in the upcoming third version).

Other good points to consider about Hadoop:

  • FOSS software, covered by Apache license 2.0.
  • widely adopted, so you can find a lot of documentation and books.
  • it works on consumer hardware, you can easily scale processing power as well as physical storage.
  • base layer for many other distributed frameworks. There are many DBMS built on top of Hadoop, as well as machine learning and Big Data frameworks.
  • Hadoop libraries are mainly written in Java, a widespread object oriented language.

Cluster overview

In an Hadoop cluster we can identify three different node types:

  • Master Nodes, responsible for managing cluster operations. A number of services run on this nodes, the most important are:
    • NameNode (Hadoop layer) manages HDFS distributed file system and keeps track of metadata
    • Resource Manager (YARN layer): YARN’s heart that schedules the cluster resources
    • Job Tracker (Hadoop layer): coordinates MapReduce parallel processing. Exists in Hadoop 1 clusters but deprecated in Hadoop 2 where is is part of YARN in the Resource Manager.

    since these services are vital for cluster operations, there are usually other Master nodes to backup the main one.

  • Slave nodes that store and process data, they provides especially two services:
    • DataNode (Hadoop layer), responsible for locally storing data under Master’s NameNode supervision.
    • NodeManager (YARN layer), supervises the slave node operations and local resources allocation.
  • Client nodes with which the users can submit new tasks and access the HDFS.

Hadoop 2 Cluster overview
In the picture you can see the existence of other two key entities:

  • ApplicationMaster (YARN layer): the ResourceManager spawns an ApplicationMaster instance for a single or a set of tasks. The AM is responsible for negotiating resources from the RM. This process runs on the slave nodes and works with NodeManager.
  • Container (YARN layer) represents a resource allocation and code execution context after an ApplicationMaster request has been accepted by the ResourceManager.

When the user demands the execution of a task, the ResourceManager spawns and register to itself an ApplicationMaster leaving to it the responsibility to ensure the correct execution of the task. When needed, the ApplicationMaster asks the ResourceManager for resources so that the NodeManagers can reserve a Container where to execute. At task’s end, the ApplicationMaster is terminated and resources are freed.
Here you can find a more in-depth analysis of how YARN works that I found quite useful.
This sequence happens at the YARN level, that relies on HDFS layer for physical storage. An Hadoop MapReduce program usually starts with an input from the HDFS and ends writing on the HDFS.

At the time I’m writing this post, I already deployed my first Hadoop (virtualized) cluster on my machine. The next Hadoop post will show how I did it, as an example.

Leave a Reply

Your email address will not be published. Required fields are marked *