Hadoop Name Node Uptime Test

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline.

High uptime of the NameNode is therefore imperative for the high availability of the file system. To make sure that the file system is always accessible to users, administrators need to continuously measure the uptime of the NameNode, promptly detect unscheduled reboots of the NameNode, investigate the root-cause for such reboots, and fix it before users notice. Similarly, they also need to keep an eye on changes to operational status of the NameNode and file system, so they can instantly detect if the NameNode is not active or if the file system is in the safe mode. The Hadoop Name Node Uptime test helps administrators achieve all of the above!

This test continuously tracks and reports the uptime of the NameNode, thus revealing whether/not a scheduled reboot occurred. Likewise, the test also alerts administrators to unexpected reboots, so they can quickly investigate and determine the reason for the same and eliminate it. This way, the test helps administrators ensure the high uptime of the NameNode. Additionally, the test periodically checks the status of the NameNode and the file system and brings abnormalities to light.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Report Manager Time

By default, this flag is set to Yes, indicating that, by default, the detailed diagnosis of this test, if enabled, will report the shutdown and reboot times of the resource manager in the manager’s time zone. If this flag is set to No, then the shutdown and reboot times are shown in the time zone of the system where the agent is running(i.e., the system on which the remote agent is running).

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enabled/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Has the system been rebooted?

Indicates whether the NameNode has been rebooted during the last measurement period or not.

 

If this measure shows 1, it means that the NameNode was rebooted during the last measurement period. By checking the time periods when this metric changes from 0 to 1, an administrator can determine the times when the NameNode was rebooted. 

The detailed diagnosis of this measure, if enabled, displays when the NameNodewas shutdown, for how long it was shutdown, when it was restarted, and whether/not the NameNode is in maintenance.

Uptime during the last measure period

Indicates the time period that the NameNode has been up since the last time this test ran.

Seconds

If the NameNode has not been rebooted during the last measurement period and the agent has been running continuously, this value will be equal to the measurement period. If the NameNode was rebooted during the last measurement period, this value will be less than the measurement period of the test. For example, if the measurement period is 300 secs, and if the NameNode was rebooted 120 secs back, this metric will report a value of 120 seconds.  The accuracy of this metric is dependent on the measurement period - the smaller the measurement period, greater the accuracy.

Total uptime of the system

Indicates the total time that the NameNode has been up since its last reboot.

 

This measure displays the number of years, months, days, hours, minutes and seconds since the last reboot. Administrators may wish to be alerted if a server has been running without a reboot for a very long period. Setting a threshold for this metric allows administrators to determine such conditions.

Status

Indicates the current status of the NameNode.

 

The values that this measure reports and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value
Standby or Stopping 0
Initializing 1

Active

2

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the status of the NameNode. The graph of this measure however, represents the same using the numeric equivalents only.

Current state of the file system

Indicates the current state of the filesystem.

 

The values that this measure reports and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value
Safemode 0
Operational 1

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the status of the file system. The graph of this measure however, represents the same using the numeric equivalents only.