Hadoop Shuffle Errors Test

In Hadoop MapReduce, the process of shuffling is used to transfer data from the mappers to the necessary reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer. It is a necessary process for reducers otherwise they would not receive any input. This means that errors in the shuffling process can cause MapReduce jobs to fail or to slow down! It is therefore important that administrators promptly capture these errors, diagnose their reasons, and fix them. This is where the Hadoop Shuffle Errors test helps!

This test monitors the shuffling process and reports the number and nature of errors encountered in the process. Detailed diagnostics reveals the precise jobs that were impacted by the shuffle errors. This way, the test not only alerts administrators to bottlenecks in the shuffling process, but also aids troubleshooting by revealing the jobs that were affected.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the target Hadoop cluster

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Report Manager Time

By default, this flag is set to Yes, indicating that, by default, the detailed diagnosis of this test, if enabled, will report the start time and finish time of jobs in the manager’s time zone. If this flag is set to No, then these times are shown in the time zone of the system where the agent is running(i.e., the system on which the remote agent is running).

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enabled/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Bad ID type

Indicates the number of BAD_ID errors that occurred in the shuffling process.

Number

BAD_ID errors are those that are related with the interpretation of IDs from shuffle headers.

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.

Connection type

Indicates the number of errors of type CONNECTION.

Number

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.

I/O error type

Indicates the number of errors of type I/O_ERROR.

Number

I/O_ERRORs are those that are related with reading and writing intermediate data.

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.

Wrong length type

Indicates the number of errors of type WRONG_LENGTH.

Number

WRONG_LENGTH errors occur when compression and decompression of intermediate data misbehaves.

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.

Wrong map type

Indicates the number of errors of type WRONG_MAP.

Number

WRONG_MAP errors are related to to duplication of the mapper output data (when framework tries to process already processed mapper output).

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.

Wrong reduce type

Indicates the number of errors of type WRONG_REDUCE.

Number

WRONG_REDUCE errors are related to the attempts of shuffling data for wrong reducer (when shuffle for determined reducer tries to shuffle the data for different reducer).

If this measure reports a value greater than 0, then you can use the detailed diagnosis of the measure to identify the jobs that were affected by these errors.