AWS Elastic MapReduce Test

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type (master node, core node, task node), giving each node a role in a distributed application like Apache Hadoop.

Amazon EMR service architecture consists of several layers, each of which provides certain capabilities and functionality to the cluster.

  • Storage layer: The storage layer includes the different file systems that are used with your cluster. There are several different types of storage options such as, Hadoop Distributed File System (HDFS), EMR File System (EMRFS), and the Local File System.
  • Cluster resource management layer: The resource management layer is responsible for managing cluster resources and scheduling the jobs for processing data.
  • Data processing frameworks layer: The data processing framework layer is the engine used to process and analyze data. The main processing frameworks available for Amazon EMR are Hadoop MapReduce and Spark.
  • Applications and Programs: Amazon EMR supports many applications, such as Hive, Pig, and the Spark Streaming library to provide various capabilities.

When a cluster is launched, you choose the frameworks and applications to install for your data processing needs. Once the chosen applications are installed, you should define the work to be done by the cluster. This can be done using Map and Reduce functions, using the EMR console / EMR API / AWS CLI, using the Hadoop API, or with the help of the interface provided by Hive or Pig. Then, the cluster processes data. To enable data processing in the cluster, you can either submit jobs or queries directly to the applications that are installed on your cluster or run steps in the cluster. Once data processing is complete, the cluster automatically shuts down (unless, auto-terminate is disabled).

If data processing is slow in a cluster, then administrators should be able to quickly and accurately tell what could be causing the slowness - map tasks? reduce tasks? storage contention at the cluster? under-utilization of task nodes and core nodes? To proactively detect processing bottlenecks in a cluster and isolate the root-cause of the bottleneck, administrators should periodically run the AWS Elastic MapReduce Test!

This test automatically discovers the clusters on Amazon EMR and tracks the status of each cluster and the progress of cluster activities. In the process, the test turns the spot light on clusters that are processing data slowly, and provides useful pointers to what could be slowing down processing.

Optionally, you can configure this test to report metrics for every job that a cluster runs. This will enable administrators to identify those jobs that could be slowing down data processing and what tasks are performed by the slow jobs - map tasks? or reduce tasks?

Target of the test: Amazon EC2 Cloud

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each cluster / job

First-level descriptor: AWS Region

Second-level descriptor: JobFlowID or ClusterID / JobId depending upon the option chosen from the EMR Filter Name parameter of this test

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

AWS Access Key, AWS Secret Key, Confirm AWS Access Key, Confirm AWS Secret Key

To monitor an Amazon EC2 instance, the eG agent has to be configured with the access key and secret key of a user with a valid AWS account. For this purpose, we recommend that you create a special user on the AWS cloud, obtain the access and secret keys of this user, and configure this test with these keys. The procedure for this has been detailed in the Obtaining an Access key and Secret key topic. Make sure you reconfirm the access and secret keys you provide here by retyping it in the corresponding Confirm text boxes.

Proxy Host and Proxy Port

In some environments, all communication with the AWS EC2 cloud and its regions could be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none , indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy User Name, Proxy Password, and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the Proxy User Name and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box. By default, these parameters are set to none, indicating that the proxy sever does not require authentication by default.

Proxy Domain and Proxy Workstation

If a Windows NTLM proxy is to be configured for use, then additionally, you will have to configure the Windows domain name and the Windows workstation name required for the same against the Proxy Domain and Proxy Workstation parameters. If the environment does not support a Windows NTLM proxy, set these parameters to none.

Exclude Region

Here, you can provide a comma-separated list of region names or patterns of region names that you do not want to monitor. For instance, to exclude regions with names that contain 'east' and 'west' from monitoring, your specification should be: *east*,*west*

EMR Filter Name

By default, this parameter is set to JobFlowId. This is the same as cluster ID, which is the unique identifier of a cluster in the form j-XXXXXXXXXXXXX. In this case, this test will report metrics for every cluster.

If required, you can override this default setting by setting the EMR Filter Name to JobId. You can use this to filter the metrics returned from a cluster down to those that apply to a single job within the cluster.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Is cluster idle?

Indicates whether/not this cluster is currently idle.

 

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

This measure reports the value Yes if the cluster is idle, and No if it is not idle. A cluster is said to be idle if it is no longer performing work, but is still alive and accruing charges.

The numeric values that correspond to these measure values are listed in the table below:

Measure Value Numeric Value

Yes

1

No

0

Note:

By default, this measure reports the above-mentioned Measure Values to indicate whether/not a cluster is idle. In the graph of this measure however, the same is indicated using the numeric equivalents.

Failed jobs

Indicates the number of jobs that have failed in this cluster.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Ideally, the value of this measure should be 0.

Running jobs

Indicates the number of jobs in this cluster that are currently running.

Numbe

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Running map tasks

By default, this measure reports the total number of map tasks that are currently running in this cluster.

If the EMR Filter Name parameter is set to JobId, then this measure will report the number of map tasks that are currently running for this job.

Number

Hadoop MapReduce is one of the main data processing frameworks available for Amazon EMR. Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Map function maps data to sets of key-value pairs called intermediate results. To put it simply, the Map procedure performs filtering and sorting (such as sorting students by first name into queues, one queue for each name).

Remaining map tasks

By default, this measure reports the total number of map tasks that are still to be run in this cluster.

If the EMR Filter Name parameter is set to JobId, then this measure will report the number of map tasks that are still to be run for this job.

Number

A remaining map task is one that is not in any of the following states: Running, Killed, or Completed.

A low value is desired for this measure. If the value of this measure is abnormally high for a cluster, you may want to change the EMR Filter Name of this test to JobId and see which job is still to run many map tasks. Such jobs could be slowing down the cluster.

Unused map task capacity

Indicates the unused map task capacity of this cluster.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster.

A high value indicates that the cluster can take more load than it is currently. A very low value indicates that the cluster is already under duress as it is running too too many map tasks. The cluster may not be able to run any more map tasks until the ones running currently are either killed or completed. This can choke the cluster and slow down processing. Under such circumstances, you may want to increase the maximum number of map tasks the cluster can run to increase its capacity.

Remaining map tasks per slot

Indicates the ratio of the total map tasks remaining to the total map slots available in this cluster.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

A value close to 1 indicates that the available map slots are about to be exhausted. You may want to open more slots by increasing the maximum number of map tasks that a cluster can run.

Running reduce tasks

By default, this measure reports the total number of reduce tasks that are currently running in this cluster.

If the EMR Filter Name parameter is set to JobId, then this measure will report the number of map tasks that are currently running for this job.

Number

Hadoop MapReduce is one of the main data processing frameworks available for Amazon EMR. Hadoop MapReduce is an open-source programming model for distributed computing. It simplifies the process of writing parallel distributed applications by handling all of the logic, while you provide the Map and Reduce functions. The Reduce function combines the intermediate results (i.e., the key-value pairs) that are provided by the Map function, applies additional algorithms, and produces the final output. To put it simply, while the Map procedure performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), the Reduce procedure performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Remaining reduce tasks

By default, this measure reports the total number of reduce tasks that are still to be run in this cluster.

If the EMR Filter Name parameter is set to JobId, then this measure will report the number of reduce tasks that are still to be run for this job.

Number

A remaining reduce task is one that is not in any of the following states: Running, Killed, or Completed.

A low value is desired for this measure. If the value of this measure is abnormally high for a cluster, you may want to change the EMR Filter Name of this test to JobId and see which job is still to run many reduce tasks. Such jobs could be slowing down the cluster.

Unused reduce task capacity

Indicates the unused reduce task capacity of this cluster.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

This is calculated as the maximum number of reduce tasks for a given cluster, less the total number of reduce tasks currently running in that cluster.

A high value indicates that the cluster can take more load than it is currently. A very low value indicates that the cluster is already under duress as it is running too too many reduce tasks. The cluster may not be able to run any more reduce tasks until the ones running currently are either killed or completed. This can choke the cluster and slow down processing. Under such circumstances, you may want to increase the maximum number of reduce tasks the cluster can run to increase its capacity.

Pending core nodes

Indicates the number of core nodes in this cluster waiting to be assigned.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

A core node is slave node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster.

All the core nodes requested may not be immediately available. Until a core node is assigned, tasks cannot be run on it. Therefore, a high value for this measure is often indicative of many pending tasks/requests. This can also slow down data processing in a cluster.

Running core nodes

Indicates the number of core nodes in this cluster that are working currently.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Data nodes receive work from Hadoop

Indicates the percentage of data nodes that are receiving work from Hadoop.

Percent

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Core nodes run the Data Node daemon to coordinate data storage as part of the Hadoop Distributed File System (HDFS). . They also run the Task Tracker daemon and perform other parallel computation tasks on data that installed applications require. For example, a core node runs YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors.

From the value of this measure, you can infer how many of the Running core nodes in the cluster are running Hadoop tasks and how many are engaged in other non-Hadoop tasks.

Pending task nodes

Indicates the number task nodes in this cluster that are currently waiting to be assigned.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

A task node is a slave node with software components that only run tasks. Task nodes are optional.

All the task nodes requested may not be immediately available. Until a task node is assigned, tasks cannot be run on it. Therefore, a high value for this measure is often indicative of many pending tasks/requests. This can also slow down data processing in a cluster.

Functional task trackers

Indicates the percentage of task trackers that are functional.

Percent

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Data read from S3

Indicates the amount of data that this cluster read from Amazon S3.

KB

These measures are reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Amazon S3 can be used as the file system for a cluster. EMRFS (EMR File System) is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.

These measures indicate the I/O generated by a cluster when reading from and writing into Amazon S3, using EMRFS. Compare the value of this measure across clusters to know which cluster is generating the maximum I/O.

Data written to S3

Indicates the amount of data written to Amazon S3.

KB

Data read from HDFS

Indicates the amount of data read from HDFS(Hadoop Distributed File System) by this cluster.

KB

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Data written to HDFS

Indicates the amount of data written into HDFS(Hadoop Distributed File System) by this cluster.

KB

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

HDFS can be used as a file system for an Amazon EMR cluster. HDFS is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. HDFS is useful for caching intermediate results during MapReduce processing or for workloads that have significant random I/O.

These measures indicate the I/O generated by a cluster when reading from and writing into HDFS. Compare the value of this measure across clusters to know which cluster is generating the maximum I/O.

HDFS utilization

Indicates the percentage of HDFS storage that this cluster is currently utilizing.

Percent

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

A value close to 100% is a cause for concern as it indicates that the cluster is about to run out of storage space.

Blocks in HDFS has not replicas

Indicates the number of blocks in which the HDFS storage used by this cluster has no replicas.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Blocks that do not have replicas may be corrupt blocks. The value 0 is therefore desired for this measure.

Concurrent data transfers

Indicates the total number of concurrent data transfers made by this cluster.

Number

This measure is reported only for a cluster - i.e., only if the EMR Filter Name parameter is set to 'JobFlowId'.

Has backup failed in HBase?

Indicates whethe/not the last backup failed for this cluster.

 

This measure is reported only for HBase clusters.

HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation's Hadoop project.

With HBase on Amazon EMR, you can also back up your HBase data directly to Amazon Simple Storage Service (Amazon S3). If this backup fails for an HBase cluster, then this measure will report the value Yes. If the backup is successful, this measure will report the value No.

The numeric values that correspond to these measure values are listed in the table below:

Measure Value Numeric Value

Yes

1

No

0

Note:

By default, this measure reports the above-mentioned Measure Values to indicate whether/not an HBase backup failed. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Time taken for previous backup to complete

Indicates the amount of time it took the previous backup of this cluster to complete.

Mins

This measure is reported only for HBase clusters.

This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. A very high value for this measure can therefore indicate a backup delay or a failure.

Elapsed time after last successful backup started

Indicates the number of elapsed minutes after the last successful HBase backup started on this cluster.

Mins

This measure is reported only for HBase clusters.