Hadoop Data Node Activity Test

Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).

DataNodes are the slave nodes in HDFS. The actual data is stored on DataNodes. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Local and remote client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. Also, the DataNodes periodically perform block verification to identify corrupt blocks.

The NameNode also initiates replication of blocks on the DataNodes as and when necessary. Moreover, DataNodes also cache blocks in off-heap caches based on caching instructions they receive from the NameNode.

Each of these operations impose load on a DataNode. Since I/O load is uniformly distributed across the DataNodes in a Hadoop cluster, an administrator needs to closely observe the I/O activity on every DataNode, so they can promptly capture load-balancing irregularities. Administrators should also assess how each DataNode is processing the I/O requests, so that they can proactively detect bottlenecks in request servicing in any DataNode. They should also check if block verification has occurred on any DataNode, so they can quickly detect verification failures. Furthermore, as block caching is a healthy exercise, administrators should also ensure that adequate blocks are cached on every DataNode. With the help of Hadoop Data Node Activity test, administrators can monitor all the activities discussed above on every DataNode. In the process, administrators can rapidly identify overloaded DataNodes, slow DataNodes, those where block verification has failed, and those where caching is sub-optimal.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for each DataNode in the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Data read rate

Indicates the rate at which data was read from this DataNode.

Reads/Sec

Compare the value of these measures across DataNodes to identify the node that is slowest in processing read requests.

Block read rate

Indicates the rate at which blocks were read from this DataNode..

Blocks/Sec

Read from local clients

Indicates the rate at which local clients performed read operations on this DataNode.

Operations/Sec

Compare the values of these measures across DataNodes to figure out if the load of read operations is uniformly distributed across all DataNodes. In the process, you can identify those DataNodes that are handing a significantly more read requests than the rest.

 

Read from remote clients

Indicates the rate at which remote clients performed read operations on this DataNode.

Operations/Sec

Data write rate

Indicates the rate at which data was written to this DataNode.

Reads/Sec

Compare the value of these measures across DataNodes to identify the node that is slowest in processing write requests.

Block write rate

Indicates the rate at which blocks were written to this DataNode..

Blocks/Sec

Write from local clients

Indicates the rate at which local clients performed write operations on this DataNode.

Operations/Sec

Compare the values of these measures across DataNodes to figure out if the load of write operations is uniformly distributed across all DataNodes. In the process, you can identify those DataNodes that are handing a significantly more write requests than the rest.

 

Write from remote clients

Indicates the rate at which remote clients performed write operations on this DataNode.

Operations/Sec

Block replication rate

Indicates the rate at which blocks in this DataNode were replicated.

Blocks/Sec

 

Block removal rate

Indicates the rate at which blocks were removed from this DataNode.

Blocks/Sec

 

Blocks verified rate

Indicates the rate at which blocks on this DataNode were verified.

Blocks/Sec

Block Verification is basically used to identify corrupt DataNode Block. During a write operation, when a DataNode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission. When the same data is read from the HDFS, the client verifies the checksum returned by the DataNode against the checksum it calculates against the data to check the data corruption that might have caused by the DataNode that might have occurred during the storage of data in the DataNode. Therefore, every DataNode periodically runs a block verification, to verify all the blocks that are stored in the DataNode. So this helps to identify and fix the corrupt data before a read operation. With the block verification service, HDFS can prematurely identify and fix corruptions.

Block verification failure rate

Indicates the rate at which block verifications failed on this DataNode.

Blocks/Sec

If block verification fails, it may result in corrupted blocks continuing to remain on a DataNode. This means that ideally, the value of this measure should be 0.

Blocks cached rate

Indicates the rate at which blocks on this node were cached.

Blocks/Sec

Whenever a request is made to a DataNode to read a block, the DataNode will read the block from disk. If you know that you will read the block many times, it is good idea to cache the block in memory. Hadoop allows you to cache a block. You can specify which file to cache (or directory) and for how long (the block will be cached in off-heap caches). Hadoop provides centralized cache management to manage the block caches. All the cache requests are made to NameNode. The NameNode will instruct the respective DataNodes to cache the blocks in off-heap caches.

Caching minimizes the I/O processing overheads and improves the overall performance of a DataNode. This is why, a high rate of caching on a DataNode is ideal. This means a high value is desired for this measure. If the value of this measure is very low for a DataNode, it may potentially cause that node to process requests slowly.

Blocks uncached rate

Indicates the rate at which the blocks on this DataNode are unached.

Blocks/Sec