Hadoop Node Manager Status Test

The Hadoop Yarn Node Manager is the per-machine/per-node framework agent who is responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager. NodeManager also tracks the health of the node on which it is running, controls auxiliary services which different YARN applications may exploit at any point in time. NodeManager can execute any computations that make sense to ApplicationMaster just by creating the container for each task.

The NodeManager runs services to determine the health of the node it is executing on. The services perform checks on the disk as well as any user specified tests. If any health check fails, the NodeManager marks the node as unhealthy and communicates this to the ResourceManager, which then stops assigning containers to the node. Communication of the node status is done as part of the heartbeat between the NodeManager and the ResourceManager. Based on the status reports received from the NodeManager, the ResourceManager schedules jobs and allocates resources to the nodes.

Administrators need to be able to quickly spot unhealthy NodeManagers, so they can dig deep and figure out which health check failed and why. Administrators also need to ensure that the communication between the NodeManager and ResourceManager is alive at all times, as a break or delay in transmission of heartbeats can severely impair the ResourceManager's operations. This is where the Hadoop Node Manager Status test helps!

This test monitors the NodeManagers running in a cluster and reports the count of managers in different states. The administrator is notified if even one manager is unhealthy, inactive, or incommunicado. The test further reveals the count of managers that have been and/or are being decommissioned, so that administrators can keep track of the progress of a cluster down-scaling exercise that they may have triggered.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the Hadoop cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Total node managers

Indicates the total number of NodeManagers in the cluster.

Number

 

Active node managers

Indicates the number of NodeManagers in the cluster that are currently active.

Number

Ideally, this value should be close to the value of the Total node managers measure.

Unhealthy node managers

Indicates the number of NodeManagers for which one/more health checks failed.

Number

Ideally, the value of this measure should be 0. A non-zero value implies that one/mode nodes in the cluster are unhealthy, and hence unavailable to store data. To ensure uninterrupted storage services, you may have to identify the node that is unhealthy and diagnose the reason for its poor health. Use the Hadoop Node Manager Health test to identify the unhealthy node.

Lost node managers

Indicates the number of NodeManagers in the cluster that are currently lost.

Number

If the NodeManager on a node has not sent heartbeats to the ResourceManager beyond a configured period of time, then such a node/NodeManager is considered as lost.

Ideally, the value of this measure should be 0.

Rebooted node managers

Indicates the number of NodeManagers that were rebooted.

Number

 

Decommissioning node managers

Indicates the number of NodeManagers that are being decommissioned.

Number

This refers to NodeManagers for which decommissioning is in progress.

Typically, lost nodes are decommissioned. Decommissioning is also performed as part of a regular cluster down-sizing procedure.

Decommissioned node managers

Indicates the number of NodeManagers that have been decommissioned.

Number