Hadoop Node Manager Resources Test

A container in YARN is where a unit of work happens in the form of task. A job/application is split in tasks and each task gets executed in one container having a specific amount of allocated resources.

A container can be understood as logical reservation of resources (memory and vCores) that will be utilized by task running in that container.

For example, your cluster has 4 nodes each having 4 GB RAM and 4 CPU Cores, making a total of 16 GB RAM and say 16 CPU Cores, of these let’s say 12 GB x 14 cores are available for use by YARN. Now, if you submit a map only MapReduce job which would spawn 8 map tasks each requiring 1 GB memory and 1 core, then your NodeManager will spawn 8 containers each having reserved resources = 1 GB x 1 core.

There can be one or more containers on given slave node depending upon the size of requested resources.

When a container fails or dies, the NodeManager detects the failure event and launches a new container to replace the failing container and restart the task execution in the new container. An administrator needs to promptly capture container failures on each node and determine the cause for such failures, so they can make sure that such anomalies do not recur. One of the common causes for frequent container failures on a node is a resource contention on that node. By tracking resource usage per node, and administrator can proactively detect a potential resource crunch, and initiate measures to avert it and the container failures it may cause.

Besides failures, administrators should also be concerned about container launch delays. If the NodeManager on a node experiences slowness every time it attempts to launch containers on that node, it can slow down task execution, and consequently affect application/job processing. It is therefore imperative that administrators promptly identify the nodes on which such delays were observed frequently and investigate the reason for the delay.

Using the Hadoop Node Manager Resources test, administrators can focus on both container failures and launching delays. For each node in a cluster, this test reports the count of containers on that node that failed and those that were launched slowly. Additionally, the test also measures how each node utilizes the allocated memory, CPU, and disk resources, and proactively alerts administrators to probable resource shortages on a node.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for each node in the Hadoop cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Launched rate

Indicates the rate at which containers were launched on this node.

Containers/Sec

A high value is desired for this measure. A steady drop in this value could indicate launching delays. You may want to time-correlate the value of this measure with that of the Launch duration measure to figure out if a node is consistently slow in launching containers.

Successfully completed rate

Indicates the rate at which containers were successfully completed on this node.

Containers/Sec

Ideally, the value of this measure should be high.

Failed containers

Indicates the rate at which containers failed on this node.

Containers/Sec

A high value is indicative of frequent container failures on a node. This is a cause for concern, as it means that tasks executing on the job are failing regularly. This in turn can adversely impact application/job processing.

You may want to check if adequate resources have been allocated to that node, so you can determine whether/not a resource shortage contributed to the container failures.

Killed rate

Indicates the rate at which containers on this node were killed.

Containers/Sec

Typically, YARN automatically kills containers that are using more memory than they are allowed to. A high value for this measure therefore indicates that many containers are utilizing memory excessively.

Initiated

Indicates the number of containers on this node that is currently initializing.

Number

 

Running

Indicates the number of running containers on this node.

Number

 

Allocated containers

Indicates the number of containers allocated to this node.

Number

 

Launch duration

Indicates the average time the NodeManager on this node took to launch containers.

Milliseconds

A high value for this measure is indicative of a delay in launching containers on the node. Compare the value of this measure across nodes to identify the node that is the slowest in launching containers.

Allocated memory

Indicates the total memory allocated to the containers on this node.

GB

 

Available memory

Indicates the amount of free memory currently available on this node for the use of containers.

GB

A high value is desired for this measure. A low value indicates excessive utilization of memory by the containers on the node. If the drop persists, it heralds a memory contention on the node, which may eventually cause containers on the node to crash. You may want to allocate more memory to the node to avoid this. Alternatively, you can identify the containers that are draining memory and resize the containers using the yarn.app.mapreduce.am.resource.mb property in the yarn.site.xml file.

Allocated vcores

Indicates the number of CPU cores allocated to this node.

Number

 

Available vcores

Indicates the number of CPU cores that are currently unused on this node.

Number

A high value is desired for this measure. A low value implies that the containers on the node are utilizing the allocated CPU resources excessively. If the drop persists, it heralds a CPU contention on the node, which may eventually cause containers on the node to crash. You may want to allocate more CPU cores to the node to avoid this.

Bad local directories

Indicates the number of bad local directories on this node.

Number

Ideally, the value of this measure should be 0. A non-zero value indicates that the node has / is rapidly running out of disk space. This is because, most often, bad local directories are observed if the available disk space on the node exceeds YARN's max-disk-utilization-per-disk-percentage (default value is 90.0%) property value in the yarn-site.xml file.

To resolve this issue, either clean up the disk that the unhealthy node is running on, or increase the threshold in yarn-site.xml

Bad log directories

Indicates the number of bad log directories on this node.

Number

Ideally, the value of this measure should be 0. If this measure reports a non-zero value, it could mean that a wrong log directory location has been configured by against the yarn.nodemanager.log-dirs in property in the yarn-site.xml file. Typically, bad log directories are reported if the configured log directory does not exist, or has wrong permissions set.

Disk space utilization across good local directories

Indicates the percentage of disk space used by the good local directories on this node.

Percent

If the Bad local directories measure reports a non-zero value for any node, then compare the value of this measure for that node with the value of the Disk space utilization across good log directories measure to figure out what is hogging disk space on the node - local directories? or log directories?

Disk space utilization across good log directories

Indicates the percentage of disk space used by the good log directories on this node.

Percent

If the Bad local directories measure reports a non-zero value for any node, then compare the value of this measure for that node with the value of the Disk space utilization across good local directories measure to figure out what is hogging disk space on the node - local directories? or log directories?