Hadoop Data Node Operations Test

A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. This is a full block report, which is a complete catalog of all of the node’s blocks. Subsequent block reports are sent every hour. These are incremental block reports, which provide the NameNode with an up-to-date view of where block replicas are located on the cluster and information related to blocks that were added/deleted recently.

Like block reports, a DataNode also sends cache reports to the NameNode. A cache report contains the list of cached blocks on a DataNode. After receiving this report, the NameNode determines the cache state on the DataNode and responds by sending caching instructions for the DataNode to execute.

If block and cache reporting occur too frequently or take too long to complete, they can cause the Hadoop cluster to malfunction. They can also render the cluster unable to serve end user requests efficiently. To avoid such an outcome, administrators should closely observe the block report and cache report operations per DataNode, rapidly identify DataNodes where too many of these operations are occurring or are taking too long to complete, deduce the cause of the same, and resolve it well before end users notice any latencies. This is where the Hadoop Data Node Operations test helps!

This test monitors the block reporting and cache reporting operations of every DataNode. In the process, the test measures the count of such operations and the time taken by each DataNode to perform them. This way, the test warns administrators of reporting abnormalities, which if left unresolved, can affect end user experience with the Hadoop cluster. With the help of the pointers offered by this test, administrators can, if required, change block and cache reporting configuration, so as to assure users of a superior experience,

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for each DataNode in the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Block report operations

Indicates the number of block report operations that this DataNode has performed.

Number

A steady increase in the value of this measure could indicate that incremental block reports are being frequently sent by a DataNode. You may want to take a look at the value of the Incremental block report operations measure to confirm the same. Too many incremental block report operations is unhealthy for the cluster.

Block report operation rate

Indicates the rate at which this DataNode performs block reporting.

Operations/Sec

A high value of this measure may indicate frequent changes to the blocks on a DataNode - eg., change in the location of block replicas, addition of many new blocks etc. You may want to check the value of the Incremental block report operation rate measure to confirm the same. Too many incremental block operations occurring in quick succession may be unhealthy for the cluster.

Average block report operation time

Indicates the average time that this DataNode takes to perform block report operations.

Milliseconds

Ideally, the value of this measure should be low. A high value or a consistent increase in the value of this measure could be indicative of latency in block reporting.

Incremental block report operations

Indicates the number of incremental block report operations performed by this DataNode.

Number

Few and well spaced out incremental block reporting operations are healthy for a cluster. In other words, a low value is ideal for both these measures.

Frequent incremental block reports from DataNodes block RPC handler threads waiting to acquire lock. Eventually, all available RPC handler threads will be consumed, while waiting to acquire lock. – In extreme cases, frequent incremental block reporting can block HA NameNode failover, because there is no RPC handler thread available to handle the failover request. Even if HA failover succeeds, the cluster may still left with many dead DataNodes, because their heartbeats could not be processed.

To prevent such disasters, you may want to consider increasing the interval between two block report operations. For this, you can use the dfs.blockreport.intervalMsec parameter in the hdfs-site.xml file.

You can also consider batching multiple block receipt events in a single RPC message.

Incremental block report operation rate

Indicates the rate at which this DataNode performed incremental block report operations.

Operations/Sec

Average incremental block report operation time

Indicates the time that this DataNode took to perform incremental block report operations.

Milliseconds

Ideally, the value of this measure should be low. A high value or a consistent increase in the value of this measure could be indicative of latency in incremental block reporting.

Cache report operations

Indicates the count of cache report operations performed by this DataNode.

Number

A steady increase in the value of this measure could indicate that cache reports are being frequently sent by a DataNode. Too many cache report operations can be unhealthy for the cluster.

Cache report operation rate

Indicates the rate at which cache report operations are performed by this DataNode.

Operations/Sec

A high value of this measure may indicate frequent changes in cache state, which caused the DataNode to keep sending cache reports to the NameNode.  Too many cache report operations occurring in quick succession may be unhealthy for the cluster.

You may want to consider modifying the interval between two cache report operations. For this, use the dfs.cachereport.intervalMsec parameter in the hdfs-default.xml file.

Average cache report operation time

Indicates the average time that this DataNode took to perform cache report operations.

Milliseconds

Ideally, the value of this measure should be low. A high value or a consistent increase in the value of this measure could be indicative of latency in cache reporting.