Hadoop Name Node Journal Transactions Test
In a typical HA cluster, two or more separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the others are in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standbys are simply acting as workers, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
If the Active node takes too long to record changes in the JournalNode machines, or if the Standby node reads edits from the JournalNode machines very slowly, the Standby node will not be able to synchronize its namespace state with that of the Active node. In the event of an Active node failure therefore, the Standby node will not be able to take over from the Active node, thereby rendering the cluster unavailable to end-users.
To ensure the high availability of the cluster, it is imperative that administrators keep tabs on how quickly edits are written to and read from JournalNode machines by the Active and Standby nodes (respectively) in the cluster, detect slowness (if any), and determine where the bottleneck is. This is what the Hadoop Name Node Journal Transactions test does!
This test measures the time taken by the Active node in a cluster to log changes to the JournalNodes. Likewise, the test also tracks how quickly/otherwise the Standby nodes read edits from the JournalNodes, so as to synchronize with the Active node. This way, the test warns administrators of slowness in Journal transactions that may potentially impact cluster availability.
Target of the test : A Hadoop cluster
Agent deploying the test : A remote agent
Outputs of the test : One set of the results for the Hadoop cluster being monitored
Parameter | Description |
---|---|
Test Period |
How often should the test be executed. |
Host |
The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. |
Port |
The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020. |
Name Node Web Port |
The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port. To determine the correct web port of the NameNode, do the following:
Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070. |
Name Node User Name |
The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter. |
Resource Manager IP and Resource Manager Web Port |
The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details. To determine the IP/host name and web port of the resource manager, do the following:
Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080. |
Resource Manager Username |
The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter. |
Measurement | Description | Measurement Unit | Interpretation |
---|---|---|---|
Journal transactions |
Indicates the rate at which Journal transactions were executed. |
Transactions/Sec |
This represents the rate at which the Active node logged changes to its namespace in the JournalNode machines. A sudden or consistent drop in the value of this measure can indicate one of the following:
If the slowness is owing to the latter, it is a cause for concern and hence warrants an investigation. |
Average time of journal transactions |
Indicates the average time taken by the Journal transactions for execution. |
Seconds |
A sudden or steady increase in the value of this measure is worrisome, as it implies that the Active node is not logging changes to the JournalNode machines as quickly as it should. Such a delay can occur if:
|
Journal syncs rate |
Indicates the rate at which the Standby nodes read edits from the JournalNode machines and synchronized with the Active node. |
Sync/Sec |
A sudden or consistent drop in the value of this measure can indicate one of the following:
If the slowness is owing to the latter, it is a cause for concern and hence warrants an investigation. |
Average time of journal syncs |
Indicates the average time taken by Standby nodes to read edits from the JournalNode machines and synchronize with the Active node. |
Seconds |
A sudden or steady increase in the value of this measure is worrisome, as it implies that the Standby node is not synchronizing with the Active node as quickly as it should. Such a delay can occur if:
|
Journal transactions batched in sync |
Indicates the rate at which Journal transactions were batched in sync. |
Transactions/Sec |
|