Hadoop Data Blocks Test

Files in Hadoop's file system are broken into block-sized chunks called data blocks. Hadoop stores each file as a sequence of blocks. The blocks of a file are replicated across Data Nodes for fault tolerance.

The NameNode makes all decisions regarding replication of blocks. The  NameNode periodically receives a Blockreport from each of the DataNodes in the cluster. A Blockreport contains a list of all blocks on a DataNode. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary.

The timely and successful replication of blocks is important for minimizing data loss at the time of disaster recovery. It is hence the responsibility of the administrator to watch over replication operations and proactively detect abnormalities (if any) in it.

Other common causes for data loss are corrupted blocks and missing blocks. An administrator needs to be able to spot such anomalies rapidly, isolate their cause, and resolve them quickly, so that Hadoop is able to deliver on its promise of 'reliable storage'.

Towards this end, administrators can take the help of the Hadoop Data Blocks test. This test monitors data blocks in the Hadoop file system and turns administrator attention to corrupt and missing blocks. This way, the test urges administrators to find the reasons for such problems and the resolution for the same. The test also monitors replication operations and alerts administrators to deviations in the replication process, so they can see if the replication policy can be tweaked to remove the deviations and improve storage reliability.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Corrupt blocks

Indicates the current number of blocks that HDFS reports as corrupted.

Number

Ideally, the value of these measures should be 0. A non-zero value for the Corrupt blocks measure indicates that one/more blocks are with corrupt replicas. A block is “with corrupt replicas” in HDFS if it has at least one corrupt replica along with at least one live replica. As such, a block having corrupt replicas does not indicate unavailable data, but they do indicate an increased chance that data may become unavailable.

A non-zero value for the Missing blocks measure indicates that one/more blocks are missing. If none of a block’s replicas are live, the block is called a missing block by HDFS.

Here are lists of potential causes and actions that you may take to handle the missing or corrupted blocks:

  • HDFS automatically fixes corrupt blocks in the background. A failure of this may indicate a problem with the underlying storage or filesystem of a DataNode. Use the HDFS fsck command to identify which files contain corrupt blocks. Delete the corrupt files and recover them from backup, if it exists.
  • Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes. Bring up the failed DataNodes with missing or corrupt blocks to resolve this issue.
  • The corrupt/missing blocks are from files with a replication factor of 1. New replicas cannot be created because the only replica of the block is missing. You may want to increase the replication factor of critical data to 3 3 to address this issue.

 

Missing blocks

Indicates the current number of missing blocks.

Number

Allocated blocks in the system

Indicates the current number of allocated blocks in the system.

Number

 

Replication scheduled blocks

Indicates the current number of blocks scheduled for replications.

Number

This value varies from datanodes being online or offline, and the number of replicas being changed in the hdfs-site.xml(dfs.replication).

Under replicated blocks

Indicates the current number of blocks that are under-replicated.

Number

UnderReplicatedBlocks are the number of blocks with insufficient replication. Hadoop’s replication factor is configurable on a per-client or per-file basis. The default replication factor is three, meaning that each block will be stored on three DataNodes. If you see a large, sudden spike in the number of under-replicated blocks, it is likely that a DataNode has died.

Deletion pending blocks

Indicates the current number of blocks that are waiting for deletion.

Number

Datanodes that are back online after being down reduces the number of blocks waiting for deletion.

Excess blocks

Indicates the current number of excess blocks in the cluster.

Number

Excess blocks can be caused by a NameNode losing heartbeats from one or more datanodes, thus resulting in the scheduling of extra replicas.