AWS EC2 Container - ECS Tests

AWS users can opt to run instances within Elastic Compute Cloud (EC2) or look into using containers. Amazon EC2 Container Service (ECS) manages Docker containers within AWS, allowing users to easily scale up or down and evaluate and monitor CPU usage. These AWS containers run on a managed cluster of EC2 instances, with ECS automating installation and operation of the cluster infrastructure. The first step to get started with ECS therefore is to create a cluster and launch EC2 instances in it. Then, create task definitions. A task is one or more Docker containers running together for one service or a microservice. When configuring a container in your task definition, you need to define the container name and also indicate how much memory and how many CPU units you want to reserve for each container. Finally, you will have to create a service, so that you can run and maintain a specified number of instances of a task definition simultaneously.

Time and again, administrators will have to check on the resource usage of each cluster, so that they can identify those clusters that have been consistently over-utilizing the CPU and memory resources. Resource usage at the individual service-level should also be monitored, so that administrators can figure out whether the excessive resource consumption by a cluster is because the cluster itself does not have enough resources at its disposal, or because one/more services running on the cluster are depleting the resources. Using the AWS EC2 Container - ECS test, administrators can monitor resource usage both at the cluster and the service-level.

For each AWS region, this test auto-discovers the clusters configured in that region and also the services running on each cluster. CPU and memory usage is then reported for each cluster and service, alongside the CPU and memory reservations (of all tasks) per cluster. These insights help administrators understand where there is a contention for resources - at the cluster-level? or at the service-level? or both? - and accordingly decide what needs to be done to optimize resource usage:

  • Should more container instances be added to the cluster to increase the amount of resources at its disposal?
  • Should the task definitions of the resource-hungry services be fine-tuned so that the service has more resources to use?

Target of the test: Amazon EC2 Cloud

Agent deploying the test: A remote agent

Output of the test:

One set ofresults for each cluster:service pair in each region of the AWS EC2 cloud

First-level descriptor: AWS EC2 region name

Second-level descriptor: cluster name and/or clustername:servicename

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

AWS Access Key, AWS Secret Key, Confirm AWS Access Key, Confirm AWS Secret Key

To monitor an Amazon EC2 instance, the eG agent has to be configured with the access key and secret key of a user with a valid AWS account. For this purpose, we recommend that you create a special user on the AWS cloud, obtain the access and secret keys of this user, and configure this test with these keys. The procedure for this has been detailed in the Obtaining an Access key and Secret key topic. Make sure you reconfirm the access and secret keys you provide here by retyping it in the corresponding Confirm text boxes.

Proxy Host and Proxy Port

In some environments, all communication with the AWS EC2 cloud and its regions could be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none , indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy User Name, Proxy Password, and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the proxy user name and proxy password parameters, respectively. Then, confirm the password by retyping it in the CONFIRM PASSWORD text box. By default, these parameters are set to none, indicating that the proxy sever does not require authentication by default.

Proxy Domain and Proxy Workstation

If a Windows NTLM proxy is to be configured for use, then additionally, you will have to configure the Windows domain name and the Windows workstation name required for the same against the proxy domain and proxy workstation parameters. If the environment does not support a Windows NTLM proxy, set these parameters to none.

Exclude Region

Here, you can provide a comma-separated list of region names or patterns of region names that you do not want to monitor. For instance, to exclude regions with names that contain 'east' and 'west' from monitoring, your specification should be: *east*,*west*

ECS Filter Name

By default, this test reports metrics for each service that is running on a cluster. Accordingly, ServiceName is the default selection from the ECS Filter drop-down. If you do not want service-level metrics, then you can configure the test to report resource usage at the cluster-level alone. For this, just select ClusterName from the ECS Filter drop-down. If this is done, then the test will only report cluster names as descriptors.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measures reported by the test:

Measurement Description Measurement Unit Interpretation

CPU reservation:

The percentage of CPU units that are reserved by running tasks in this cluster.

Percent

This measure is reported at the cluster-level only - i.e., for the ClusterName descriptor alone.

This value is computed using the following formula:

Total CPU units reserved by ECS tasks on the cluster / Total CPU units that were registered for all the container instances in the cluster * 100

A value close to 100% indicates that almost all resources available to the cluster are being reserved by running tasks in that cluster. This implies that additional services cannot be configured on that cluster until more resources are made available to the cluster or until the CPU reservation of running tasks is reduced.

CPU utilization:

Indicates the percentage of CPU units used by this cluster or by this service

Percent

For a cluster, this value is computed using the following formula:

Total CPU units currently used by ECS tasks on this cluster / Total CPU units that were registered for all the container instances in this cluster * 100

A value close to 100% for this measure at the cluster-level could either indicate that the cluster is resource-starved or that one/more services running on the cluster are consuming excessive resources.

If the reason for high CPU usage is the poor resource configuration of the cluster, then, you may want to add more instances to the cluster to add to its resource base. On the other hand, if the cluster is adequately sized with CPU, then you may want to check the value of this measure for each of the services running on the cluster .

For a service, this value is computed using the following formula:

Total CPU units currently used by ECS tasks defined for this service / Total CPU units that are reserved for the tasks defined for this service * 100

Compare the value of this measure across services of a cluster to know which services of that cluster are guilty of over-utilization of CPU. Once the services are identified, check the CPU reservation of the task definitions of those services to determine whether sufficient resources have been allocated to those tasks. If not, increase the reservations to allow optimal resource usage.

Memory reservation:

The percentage of memory that is reserved by running tasks in this cluster.

Percent

This measure is reported at the cluster-level only - i.e., for the ClusterName descriptor alone.

This value is computed using the following formula:

Total amount of memory reserved by ECS tasks on the cluster / Total amount of memory that was registered for all the container instances in the cluster * 100

A value close to 100% indicates that almost all resources available to the cluster are being reserved by running tasks in that cluster. This implies that additional services cannot be configured on that cluster until more resources are made available to the cluster or until the memory reservation of running tasks is reduced.

Memory utilization:

Indicates the percentage of memory used by this cluster or by this service

Percent

For a cluster, this value is computed using the following formula:

Total memory currently used by ECS tasks on this cluster / Total memory that is registered for all the container instances in this cluster * 100

A value close to 100% for this measure at the cluster-level could either indicate that the cluster is resource-starved or that one/more services running on the cluster are consuming excessive resources.

If the reason for high memory usage is the poor resource configuration of the cluster, then, you may want to add more instances to the cluster to add to its resource base. On the other hand, if the cluster is adequately sized with memory, then you may want to check the value of this measure for each of the services running on the cluster .

For a service, this value is computed using the following formula:

Total memory currently used by ECS tasks defined for this service / Total memory reserved for the tasks defined for this service * 100

Compare the value of this measure across services of a cluster to know which services of that cluster are guilty of over-utilization of memory. Once the services are identified, check the memory reservation of the task definitions of those services to determine whether sufficient resources have been allocated to those tasks. If not, increase the reservations to allow optimal resource usage.

Is active?

Indicates whether this cluster is active or not.

 

This measure is reported only for every cluster - i.e., only when the 'ECS Filter Name' parameter is set to ClusterName.

The value that this measure can report and its corresponding numeric value are listed in the table below:

Measure Value

Numeric Value

Yes

1

No

0

Note:

By default, this measure reports one of the Measure Values in the table above to indicate whether/not a cluster is active. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Active services

Indicates the number of active services in this cluster.

Number

This measure is reported only for every cluster - i.e., only when the 'ECS Filter Name' parameter is set to ClusterName.

To know the active services, use the detailed diagnosis of this measure. The details displayed in the detailed diagnosis include:

Service Name: The name of the active service

CPU utilization: Average percentage of CPU units that are used in the service.

Memory utilization: Average percentage of memory that is used in the service

Desired tasks: Number of containers desired per service

Running tasks: Number of containers running per service

Pending tasks: Number of containers pending per service

 

Running task

Indicates the number of tasks that are in the running state in this cluster.

Number

These measures are reported only for every cluster - i.e., only when the 'ECS Filter Name' parameter is set to ClusterName.

A task definition is required to run Docker containers in Amazon ECS. You can define multiple containers in a task definition.

Using a Service, Amazon ECS can run and maintain a specified number of instances (the "desired count") of a task definition simultaneously in an Amazon ECS cluster.

Typically, when a task is first pushed into ECS, it is in the PENDING state. Once the task starts running, it switches to the RUNNING state.

At any given point in time, the count of running tasks should be equal to the number of desired tasks for that cluster. If a task in a service stops, the task is killed and a new task is launched. This process continues until your service reaches the number of desired running tasks.

Pending task

Indicates the number of tasks in pending state in this cluster.

Number

Container instances

Indicates the number of container instances that are assigned to this cluster.

Number

This measure is reported only for every cluster - i.e., only when the 'ECS Filter Name' parameter is set to ClusterName.

To know which container instances are assigned to a cluster, use the detailed diagnosis of this measure. The details displayed as part of detailed diagnosis include the Instance ID, region to which the instance belongs, and the status of the instance. Additionally, the following are also displayed per instance:

Registered CPU: Number of CPU units registered on the container instance.

Remaining CPU: Number of CPU units remaining on the container instance.

Registered memory: Number of Memory units registered on the container instance.

Remaining memory: Number of Memory units remaining on the container instance.

Running tasks: Number of running tasks for the container instance

Pending tasks: Number of pending tasks for the container instance

Is container agent connected: Indicates whether the container agent is connected to the instance or not.

Docker version: The version of the Docker container instance

From these details, you can quickly isolate those container instances that are running out of CPU and memory resources and those that are disconnected from the container agent.