Hadoop Data Node Heartbeats Test

A 'heartbeat' is a signal sent between a DataNode and NameNode. This signal is taken as a sign of vitality. If there is no response to the signal, then it is understood that there are certain health issues/ technical problems with the DataNode or the TaskTracker.

The default heartbeat interval is 3 seconds. If the NameNode does not receive any heartbeats from a DataNode for a period of 10 minutes, then a 'Heartbeat Lost' condition occurs and the corresponding DataNode is deemed to be dead/unavailable.

To avoid the loss of heartbeats and the consequent failure of a DataNode, administrators must keep a close watch on the heartbeats sent by each DataNode to the NameNode, detect issues in the transmission of heartbeats, and clear the bottlenecks well before the configured timeout period expires and the DataNode is declared dead. This can be achieved using the Hadoop Data Node Heartbeats test!

This test monitors the heartbeats that each DataNode sends to the NameNode. In the process, the test reports the count of heartbeats that every DataNode sends during a measure period, the rate at which the heartbeats were sent, and the average time taken for the transmission. Alerts are promptly sent out if a DataNode does not send out any heartbeat or takes too much time to do so. This way, administrators can proactively detect problems in heartbeat communication and can resolve them before DataNodes die.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for each DataNode in the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Heart beats

Indicates the count of heart beats sent by this DataNode to the NameNode during the last measurement period.

Number

By default, heartbeats are sent every 3 seconds. The default frequency of a test is 5 minutes. In the default scenario therefore, a 100 heartbeats (300 / 3) should have been sent in a single measure period. If lesser or no heartbeats were sent during a measure period, it could imply a problem with the DataNode. If the heartbeat loss occurred owing to the disk failure on the DataNode, then you may have to replace a disk on the DataNode host or perform a disk hot swapfor DataNodes. If a DataNode could not send heartbeats for any other reason, then you may have to recommission that DataNode to add it back to the cluster.

Heart beat rate

Indicates the rate at which this DataNode sent heartbeats to the NameNode.

Heartbeats/Sec

 

Average heart beat time

Indicates the average time to send a heartbeat from this DataNode to the NameNode.

Milliseconds

A high value or a consistent increase in the value of this measure is a cause for concern, as it means that the DataNode is sending heartbeats slowly to the NameNode. A bad network connection between the DataNode and NameNode is one of the common causes for slow transmission of heartbeats.