Hadoop FS-Image Editlogs

FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node. This file is used by the NameNode when it is started.

EditLogs is a transaction log that records the changes in the HDFS file system or any action performed on the HDFS cluster such as addition of a new block, replication, deletion etc. In short, it records the changes since the last FsImage was created.

Every time the NameNode restarts, EditLogs are applied to FsImage to get the latest snapshot of the file system. But NameNode restarts are rare in production clusters. Because of this, you may encounter the following issues: .

  • EditLog grows unwieldy in size, particularly where the NameNode runs for a long period of time without a restart;
  • NameNode restart takes longer, as too many changes now have to be merged
  • If the NameNode fails to restart (i.e., crashes), there will be significant data loss, as the FsImage used at the time of the restart is very old

Secondary Namenode helps to overcome the above issues by taking over the responsibility of merging EditLogs with FsImage from the NameNode.

  • The Secondary NameNode obtains the FsImage and EditLogs from the NameNode at regular intervals.
  • Secondary NameNoide loads both the FsImage and EditLogs to main memory and applies each operation from the EditLogs to the FsImage.
  • Once a new FsImage is created, Secondary NameNode copies the image back to the NameNode.
  • Namenode will use the new FsImage for the next restart, thus reducing startup time.

However, this seemingly fail-proof process is not without issues. Delays in the aforesaid process can cause a NameNode to startup without the latest FsImage at its disposal. Such delays can occur if:

  • The Secondary NameNode takes too long to download the EditLogs from the NameNode;
  • The NameNode is slow in uploading FsImages to the Secondary NameNode and/or in downloading the updated FsImages from the Secondary NameNode

To avoid such delays, administrators will have to closely monitor the communication between the NameNode and Secondary NameNode, proactively detect any slowness in the upload and/or download of FsImages / EditLogs, and promptly initiate measures to isolate and remove the source of the slowness. This is where the Hadoop FS Image EditLogs test helps!

This test monitors the following:

  • How quickly the Secondary NameNode downloads EditLogs from the NameNode;
  • How quickly the NameNode uploads and downloads FsImages from the Secondary NameNode

In the process, the test sheds light on latencies in communication and processing that could be slowing down uploads/downloads between the primary and secondary nodes in the cluster.

Target of the test : A Hadoop cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of the results for the Hadoop cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the NameNode that processes client connections to the cluster. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

Port

The port at which the NameNode accepts client connections. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. By default, the NameNode's client connection port is 8020.

Name Node Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. To run API commands on the NameNode and pull metrics, the eG agent needs access to the NameNode's web port.

To determine the correct web port of the NameNode, do the following:

  • Open the hdfs-default.xml file in the hadoop/conf/app directory.
  • Look for the dfs.namenode.http-address parameter in the file.
  • This parameter is configured with the IP address and base port where the DFS NameNode web user interface listens on. The format of this configuration is: <IP_Address>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:50070

Configure the <Port_Number> in the specification as the Name Node Web Port. In the case of the above sample configuration, this will be 50070.

Name Node User Name

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the NameNode. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Resource  Manager IP and Resource Manager Web Port

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

To pull metrics from the resource manager, the eG agents first needs to connect to the resource manager. For this, you need to configure this test with the IP address/host name of the resource manager and its web port. Use the Resource Manager IP and Resource Manager Web Port parameters to configure these details.

To determine the IP/host name and web port of the resource manager, do the following:

  • Open the yarn-site.xml file in the /opt/mapr/hadoop/hadoop-2. x.x/etc/hadoop directory.
  • Look for the yarn.resourcemanager.webapp.address parameter in the file.
  • This parameter is configured with the IP address/host name and web port of the resource manager. The format of this configuration is: <IP_Address_or_Host_Name>:<Port_Number>. Given below is a sample configuration:

    192.168.10.100:8080

Configure the <IP_Address_or_Host_Name> in the specification as the Resource Manager IP, and the <Port_Number> as the Resource Manager Web Port. In the case of the above sample configuration, this will be 8080.

Resource Manager Username

The eG agent collects metrics using Hadoop's WebHDFS REST API. While some of these API calls pull metrics from the NameNode, some others get metrics from the resource manager. The YARN Resource Manager Service (RM) is the central controlling authority for resource management and makes resource allocation decisions.

In some Hadoop configurations, a simple authentication user name may be required for running API commands and collecting metrics from the resource manager. When monitoring such Hadoop installations, specify the name of the simple authentication user here. If no such user is available/required, then do not disturb the default value none of this parameter.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Edits downloads from secondary node

Indicates the rate at which the Secondary NameNode downloads EditLogs.

Downloads/Sec

A low value for this measure or a steady decrease in the value of this measure could indicate that the Secondary NameNode is slow in downloading edits. One reason for this could be the size of the edits - if too many changes/edits need to be downloaded, then the download process will be slow. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.

Average edits download time

Indicates the time taken for the EditLogs to be downloaded by the Secondary NameNode.

Seconds

A low value is desired for this measure. An unusually high value is indicative of slowness when downloading edits. One reason for this could be the size of the edits - if too many changes/edits need to be downloaded, then the download process will be slow. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.

FsImage downloads from secondary node

Indicates the rate at which the updated FsImages are downloaded from the Secondary NameNode.

Downloads/Sec

A high value is desired for this measure. A low value is indicative of latency when downloading the latest snapshot of data from the Secondary NameNode. One reason for this could be the size of the FsImage - if too many changes/edits were applied to the old FsImage, the resultant snapshot will be of a large size. Large files naturally, take longer to download. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.

Average FsImage download time

Indicates the time taken to download the updated FsImages from the Secondary NameNode.

Number

A low value is desired for this measure. A high value indicates that the NameNode is downloading FsImages lazily. One reason for this could be the size of the FsImage - if too many changes/edits were applied to the old FsImage, the resultant snapshot will be of a large size. This can delay downloading. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.

FsImage uploads to secondary namenode

Indicates the rate at which FsImages were uploaded to the Secondary NameNode

Uploads/Sec

A high value is desired for this measure. A low value is indicative of latency when uploading the FsImage from the NameNode to the Secondary NameNode. One reason for this could be the size of the FsImage - if the FsImage to be updated is large in size, it will take a while for the NameNode to upload it to the Secondary NameNode. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.

Average FsImage upload time

Indicates the time taken to upload the FsImage to the Secondary NameNode.

Seconds

A low value is desired for this measure. A high value indicates that the NameNode is uploading FsImages to the Secondary NameNode, lazily. One reason for this could be the size of the FsImage - if the FsImage to be updated is large in size, it will take a while for the NameNode to upload it to the Secondary NameNode. Another reason could be the poor quality of the network connection between the NameNode and the Secondary NameNode.