Hadoop Resource Manager RPC Activity Test

ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).

Clients communicate with the RM via RPC to submit applications, terminate applications, obtain queue information, and to retrieve cluster statistics. Nodes in the cluster interact with the RM over RPC for registration, for submitting resource requests, and for routing heartbeats to the YARN scheduler. ApplicationMasters also communicate with the RM via RPC for registration and for submitting termination / unregister / cointainer-allocation / container-deallocation requests to the YARN scheduler. Additionally, the RM also manages secret-keys used to authenticate/authorize requests on various RPC interfaces.

Typically, upon receipt of the RPC calls, the RM puts them in a queue for execution. If the RM is unable to process the requests in queue quickly, the queue length will keep increasing. If this processing bottleneck is not resolved rapidly, the RM may end up being overloaded with RPC requests, which may eventually cause the RM to choke and fail to respond. Administrators hence need to keep an eye on the RPC operations performed via each RPC interface on the RM, so they can promptly capture overload conditions and latent RPC activities.

Also, since the RM manages the authentication/authorization requests, administrators need to be able to rapidly capture and investigate authentication/authorization failures, so that the cluster is protected from malicious attacks.

With the help of the Hadoop Resource Manager RPC Activity test, administrators can observe RPC activity on each RPC interface. In the process, they can:

Identify the exact RPC interface that is overloaded with connections;
Detect slowness in RPC request processing well before users complain, and precisely pinpoint the latent interface;
Be alerted to repeated authentication/authorization failures on any interface

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for each RPC interface on the ResourceManager

Configurable parameters for the test
Parameter	Description
Test Period	How often should the test be executed.
Host	The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.
Port	The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.
Name Node Web Port	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port. To determine the correct web port of the NameNode, do the following: Open the hdfs-default.xml file in the hadoop/conf/app directory. Look for the dfs.namenode.http-address parameter in the file. This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration: 192.168.10.100:50070 Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.
Name Node User Name	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.
Resource Manager IP and Resource Manager Web Port	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details. To determine the IP/host name and web port of the resource manager, do the following: Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory. Look for the yarn.resourcemanager.webapp.address parameter in the file. This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration: 192.168.10.100:8080 Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.
Resource Manager Username	The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions. In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enabled/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test
Measurement	Description	Measurement Unit	Interpretation
RPC call rate	Indicates the rate at which RPC calls were received by the RM via this interface.	Calls/Sec	A high value is indicative of high RPC activity on an interface.
Average queue time	Indicates the average time RPC requests received via this interface spent in the queue.	Milliseconds	If the value of this measure grows continuously, it is indicative of latency in request processing.
Average processing time	Indicates the average processing time of RPC requests received via this interface.	Milliseconds	If the value of the Average queue time measure increases consistently for an interface, then take a look at the value of this measure for the same interface. If the value of this measure is also increasing alongside the value of the Average queue time measure for an interface, it is a clear indication of a processing bottleneck on that interface.
Authentication success rate	Indicates the rate at which RPC interactions via this interface were successfully authenticated.	Successes/Sec	Ideally, the value of this measure should be high.
Authentication failure rate	Indicates the rate at which RPC communications via this interface failed authentication.	Failures/Sec	A low value is desired for this measure. A significant and unexpected spike in this value could indicate attempts to hack the cluster. Such accesses should be pulled up for closer scrutiny.
Authorization success rate	Indicates the rate at which RPC calls made via this interface were successfully authorized.	Successes/Sec	Ideally, the value of this measure should be high.
Authorization failure rate	Indicates the rate at which RPC calls made via this interface failed authorization.	Failures/Sec	A low value is desired for this measure. A significant and unexpected spike in this value could indicate attempts to hack the cluster. Such accesses should be pulled up for closer scrutiny.
Queue length	Indicates the count of RPC calls received via this interface that are in queue currently.	Number	If this value keeps increasing with time for any interface, it indicates that RPC request processing is probably bottlenecked on that interface.
Current open connections	Indicates the count of RPC connections currently open on this interface.	Number	This is a good indicator of the current load on an interface. Compare the value of this measure across interfaces to identify the overloaded interface. You can also use the detailed diagnosis of this measure to which user has how many connections open via the interface. In the event of an overload, these detailed metrics will point you to the precise user responsible for it.