Hadoop Name Node Retry Cache Test

The NameNode uses a retry cache to prevent the failure of non-idempotent operations (e.g., create, append, and delete). If retry caching is enabled on the NameNode, then the NameNode tracks previously received non-idempotent requests based on the ClientId and CallId of the requests,and stores the corresponding responses in the cache. Sometimes, a request may be completed, but the client may not have received the response for some reason. If such a client retries the request, the response is served from the retry cache.

Administrators can allocate heap memory to the cache for storing the responses. To optimize memory usage, administrators can also configure how long entries can 'live' in the cache, before they expire. If these configurations are not set prudently, then the retry cache may not be able to serve client retry requests, thus causing such requests to fail.

To avoid this, administrators must continuously monitor the usage of the retry cache, promptly spot any sudden/steady decrease in hits to the cache, and promptly figure out what could be the cause of poor cache usage. This is where the Hadoop Name Node Retry Cache test helps!

This test tracks retry requests to the NameNode and reports the rate at which these requests are served by the retry cache on the node. In the process, administrators can figure out whether/not the cache is being used optimally. The test also tracks the count of entries that are cleared from the cache from time to time. With the help of this metric, the administrator can figure out if the expiry period of the cache entries has been correctly set.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the Hadoop cluster being monitored

Configurable parameters for the test
Parameter	Description
Test Period	How often should the test be executed.
Host	The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.
Port	The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.
Name Node Web Port	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port. To determine the correct web port of the NameNode, do the following: Open the hdfs-default.xml file in the hadoop/conf/app directory. Look for the dfs.namenode.http-address parameter in the file. This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration: 192.168.10.100:50070 Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.
Name Node User Name	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.
Resource Manager IP and Resource Manager Web Port	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details. To determine the IP/host name and web port of the resource manager, do the following: Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory. Look for the yarn.resourcemanager.webapp.address parameter in the file. This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration: 192.168.10.100:8080 Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.
Resource Manager Username	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement	Description	Measurement Unit	Interpretation
Retry cache cleared	Indicates the number of entries that were cleared from the retry cache during the last measurement period.	Number	If the value of this measure remains consistently high, it is a cause for concern, as it indicates that many entries are cleared from the cache during each measurement period. If many responses are cleared from the cache at frequent intervals, the likelihood of a retry request not finding a match in the cache is very high. This can increase cache misses, which in turn can cause retry request failures. To avoid this, you may want to relook at how long you have configured entries to be retained in the cache. If the retention period configured is very short, then naturally, entries will be frequently cleared from the cache. As a result, a few latest responses may also be cleared from the cache. In this case, if the client retries a request, the request may not find a corresponding response in the cache. This can cause the request to fail. To prevent this, you can increase the retention period of cache entries by editing the value of the dfs.namenode.retrycache.expirytime.millis parameter in the hdfs-default.xml file.
Retry cache hit	Indicates the rate at which client retries are serviced by the retry cache.	Hits/Sec	A high value is desired for this measure, as it is indicative of a healthy retry cache and optimal cache usage. If this value steadily decreases, it could imply that the cache is unable to service many of the retry requests. The probable reasons for this and their possible resolutions are detailed below: The cache is poorly sized: If adequate heap memory is not allocated to the cache, then the cache will not be able to store many entries within. As a result, it may not be able to service retry requests effectively. In such a situation, you may want to consider allocating more heap memory to the retry cache. For that, increase the percentage value configured against dfs.namenode.retrycache.heap.percent in the hdfs-default.xml file. The cache is configured with a short retention period: If the retention period configured is very short, then naturally, entries will be frequently cleared from the cache. As a result, a few latest responses may also be cleared from the cache. In this case, if the client retries a request, the request may not find a corresponding response in the cache. This can cause the request to fail. To prevent this, you can increase the retention period of cache entries by editing the value of the dfs.namenode.retrycache.expirytime.millis parameter in the hdfs-default.xml file
Retry cache updated	Indicates the rate at which the retry cache was updated.	Updates/Sec