Elastic Compute Service - ECS Test

Elastic Compute Service (ECS) is a high-performance, stable, reliable, and scalable IaaS-level service provided by Alibaba Cloud. ECS eliminates the need to invest in IT hardware up front and allows you to quickly scale computing resources on demand. This makes ECS more convenient and efficient than physical servers.

An ECS instance is a virtual machine that contains basic computing components such as the vCPU, memory, operating system, network, and disk. You can fully customize and modify all configurations of an ECS instance. After you log on to the Alibaba Cloud Management console, you can manage resources and configure the environment of your ECS instances.

The lifecycle of an ECS instance begins when the instance is created and ends when the instance is released. During this lifecycle, an ECS instances goes through many states. Tracking these states can help administrators quickly and easily resolve user complaints regarding the unavailability/inaccessibility of an instance, which in turn helps in elevating the user experience with that instance.

ECS instances are categorized into different instance families based on business scenarios. An instance family contains different instance types based on their vCPU and memory specifications. Instance types can have different vCPU and memory specifications, such as the CPU model and clock speed. As business requirements change, organizations may want to switch to an instance type that better suits their requirements. It is the responsibility of an administrator to monitor how an instance uses its vCPU and memory specification over time, spot potential resource contentions , and urge the organization to upgrade/downgrade to an appropriate instance type, so as to ensure smooth and uninterrupted transaction of business.

An ECS instance must contain a system disk to store the operating system and core configurations. An image is used to initialize a system disk and determines the operating system and initial software configurations of an ECS instance. Typically, the capacity of system disks is small. Therefore, it is good practice for administrators to continuously track the usage of and I/O activity on the system disks of every instance, and identify those instances with storage space that is insufficient for their needs. By adding more disks to such instances, administrators can enable the instances to boot up without a glitch, thus allowing end-users on-demand access.

Besides vCPU, memory, and disk usage, administrators should also pay attention to the bandwidth usage of instances, so that bandwidth-hungry instances can be identified

With the help of the Elastic Compute Service - ECS test, administrators can achieve all of the above! This test auto-discovers the ECS instances deployed in an Alibaba cloud account. For each instance, this test reports the state of that instance, and alerts administrators if any instance is in an abnormal state (eg., expired, expiring, locked etc.). When instance owners complaint of being unable to access their instances, administrators can instantly figure out if the inaccessibility can be attributed to the abnormal state of the instances. In addition, the test keeps a close watch on the resource (vCPU, memory, disk, and network) usage of each ECS instance in a monitored Alibaba cloud account. In the process, administrators can quickly and accurately identify instances that are over-utilizing resources, and initiate measures to right-size such instances - eg., by way of recommending an upgrade to an instance type with a higher vCPU/memory configuration, by adding more system disks to instances that are running out of disk space, etc.

Target of the test : An Alibaba Cloud Account

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each instance in the Alibaba cloud account that is being monitored

Configurable parameters for the test
Parameters	Description
Test period	How often should the test be executed
Host	The host for which the test is to be configured.
Alibaba Access Key and Alibaba Secret Key	This test makes REST API requests to the Alibaba cloud to pull the metrics. For this purpose, the test needs to be configured with an AccessKey pair. An AccessKey pair is typically used to call an operation of an Alibaba Cloud service. It is also used to initiate an API request or use a cloud service SDK to manager cloud resources. An AccessKey pair is characterized by an AccessKey ID and an AccessKey Secret. The AccessKey ID is used to identify a user/cloud account. The AccessKey Secret is used to verify a user/cloud account. The first step to configuring the eG agent with an AccessKey pair is to create an AccessKey pair for the target cloud acount. To achieve this, follow the steps below: Log on to the RAM console by using an Alibaba Cloud account. In the left-side navigation pane, click Users under Identities. On the Users page, click the username of the RAM user for which you want to create an AccessKey pair in the User Logon Name/Display Name column. On the page that appears, click Create AccessKey in the User AccessKeys section. Note: You must enter a verification code if you create an AccessKey pair for the first time. Click Close. Note: The AccessKey secret is displayed only when you create an AccessKey pair. If the AccessKey pair is leaked or lost, you must create a new one. You can create a maximum of two AccessKey pairs. Make note of the AccessKey ID and AccessKey secret, once they are displayed. Then, configure the Alibaba Access Key parameter of the test with the AccessKey ID, and the Alibaba Secret Key parameter with the AccessKey Secret you made note of. If you failed to make note of the AccessKey ID and AccessKey Secret at the time of creating the AccessKey pair, then you can obtain the same at a later point in time. Similarly, if an AccessKey pair pre-exists for the target cloud account, then you do not have to create another one. Instead, you can obtain the AccessKey ID and AccessKey Secret of the existing AccessKey pair and configure the eG agent with the same. For this, follow the steps below: Use an Alibaba Cloud account to log on to the Alibaba Cloud Management console. Move the pointer over the profile picture in the upper-right corner, and click AccessKey. In the Security Tips message that appears, click Continue to manage AccessKey. AccessKey ID and AccessKey Secret are displayed. Make note of the displayed ID and secret. Then, configure the Alibaba Access Key parameter of the test with the AccessKey ID, and the Alibaba Secret Key parameter with the AccessKey Secret you made note of.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Status

Indicates the current state of this instance.

The values that this measure reports and their corresponding numeric values are listed below:

Measure Value	Numeric Value
Running	1
Preparing	2
Starting	3
Expiring	4
Stopping	5
Stopped	6
Expired	7
Expired and being recycled	8
Overdue and being recycled	9
Locked	10
Release pending	11

Some of the Measure Values listed in the table above are described below:

Preparing: After an instance is created, it is in this state before it enters the Running state. If the instance remains in this state for an extended period of time, an exception occurs.
Starting: After an instance is created, it is in this state before it enters the Running state. If the instance remains in this state for an extended period of time, an exception occurs.
Running: If an instance runs properly, it is in this state.
Expiring: A subscription instance remains in the Expiring state for 15 days before it expires. If your instance enters the Expiring state, we recommend that you renew the instance in a timely manner.
Stopping: When you stop an instance by using the ECS console or by calling an API operation, the instance enters this state before it enters the Stopped state. If the instance remains in this state for an extended period of time, an exception occurs.
Stopped: After an instance is stopped or after an instance is created but has not started, it is in the Stopped state.
Expired: When a subscription instance expires or when a pay-as-you-go instance is stopped due to overdue payments, the instance enters the Expired state.
Locked: If you have an overdue payment in your account or if your account is insecure, your instance enters the Locked state. You can submit a ticket to unlock the instance.
To be released: If you apply for a refund for a subscription instance before the instance expires, the instance enters the To Be Released state.

Note:

This measure reports the Measure Values listed in the table above to indicate the current state of an ECS instance. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Use the detailed diagnosis of this measure to know more about the instance. The details displayed include the instance type, when it was created, the operating system of the instance, the region and zone to which the instance belongs, the image from which the instance was created, and the network type, IP addresses, VPC, and security group of the instance.

Total CPU

Indicates the total number of CPU cores configured for this instance.

Number

Total memory

Indicates the memory configuration of this instance.

Network throughput

Indicates the total inbound and outbound bandwidth usage of this instance.

Kbps

Compare the value of this measure across instances to know which instance is making the most use of the bandwidth resources.

Network inbound bandwidth

Indicates the maximum bandwidth used by traffic flowing into this instance from the public network.

Kbps

These metrics will give administrators an idea as to where public bandwidth resources are spent.

Network outbound bandwidth

Indicates the maximum bandwidth used by traffic flowing out of this instance to the public network.

Kbps

CPU utilization

Indicates the percentage of allocated CPU units that is currently used by this instance.

Percent

A value close to 100% for an instance indicates that such an instance is over-utilizing the CPU resources allocated to it.

Intranet traffic received

Indicates the bandwidth consumed by traffic flowing into this instance from the intranet.

Kbps

By comparing the value of this measure across instances, you can accurately identify the instance that is receiving bandwidth-intensive intranet traffic.

Intranet traffic sent

Indicates the bandwidth consumed by traffic flowing out of this instance to the intranet.

Kbps

By comparing the value of this measure across instances, you can accurately identify the instance that is sending bandwidth-intensive intranet traffic.

Intranet bandwidth

Indicates the total bandwidth consumed by intranet traffic flowing into and out of this instance.

Kbps

Compare the value of this measure across instances to identify the instance handling bandwidth-intensive intranet traffic. You can then compare the value of the Intranet traffic received and Intranet traffic sent measures of that instance to figure out what type of intranet traffic is hogging the bandwidth resources - incoming traffic? or outgoing traffic?

Internet bandwidth

Indicates the total bandwidth consumed by internet traffic flowing into and out of this instance.

Kbs

Compare the value of this measure across instances to identify the instance handling bandwidth-intensive internet traffic. You can then compare the value of the Internet traffic received and Internet traffic sent measures of that instance to figure out what type of internet traffic is hogging the bandwidth resources - incoming traffic? or outgoing traffic?

Internet traffic received

Indicates the bandwidth consumed by traffic flowing into this instance from the internet.

Kbps

By comparing the value of this measure across instances, you can accurately identify the instance that is receiving bandwidth-intensive internet traffic.

Internet traffic sent

Indicates the bandwidth consumed by traffic flowing out of this instance to the internet.

Kbps

By comparing the value of this measure across instances, you can accurately identify the instance that is sending bandwidth-intensive internet traffic.

Disk IOPS

Indicates rate at which I/O operations were performed on the disks of this instance.

Operations/Sec

Compare the value of this measure across instances to know which instance is experiencing unusually high levels of I/O activity. In such a situation, you can compare the value of the Disk read operations and Disk write operations measures for that instance to accurately isolate what caused the I/O overload - a high rate of read operations? or write operations?

Disk read operations

Indicates the rate at which disk read operations were performed by this instance.

Operations/Sec

By comparing the value of this measure across instances, you can accurately identify the instance that is experiencing a high level of disk read operations.

Disk write operations

Indicates the rate at which disk write operations were performed by this instance.

Operations/Sec

By comparing the value of this measure across instances, you can accurately identify the instance that is experiencing a high level of disk write operations.

Disk throughput

Indicates the bandwidth consumed by disk read/write operations on this instance.

KB/Sec

If this measure is very high for an instance, it means that the I/O activity on the disks of that instance is consuming bandwidth excessively. In such a situation, you can compare the value of the Disk read bandwidth and Disk write bandwidth measures of that instance to understand what type of I/O activity is contributing to the unusual bandwidth consumption - read activity? or write activity?

Disk read bandwidth

Indicates the bandwidth consumed by disk read operations on this instance.

KB/Sec

Compare the value of this measure across instances to know which instance is engaged in bandwidth-intensive disk reads.

Disk write bandwidth

Indicates the bandwidth consumed by disk write operations on this instance.

KB/Sec

Compare the value of this measure across instances to know which instance is engaged in bandwidth-intensive disk writes.

CPU credit usage

Indicates the number of CPU credits consumed by this instance.

Number

This measure is reported only for burstable instances.

Burstable instances are an economical instance type that is intended to cope with burstable performance requirements in entry-level computing scenarios. These instances use CPU credits to ensure computing performance, and are suited for scenarios where CPU usage is typically low but bursts in CPU usage occur on occasion. You can accumulate CPU credits that can be used to increase the computing performance of burstable instances when required by your workloads. The CPU credit mechanism allows you to minimize the consumption of resources during off-peak hours, and scale resources out during peak hours at no extra cost.

When you create a burstable instance, 30 CPU credits are provisioned for each vCPU of the instance, which are initial CPU credits. These credits enable you to complete deployment tasks after you start the instance. When a burstable instance is started, it starts to consume CPU credits to maintain its computing performance. The value of this measure denotes the number of CPU credits so spent.

By comparing the value of this measure across burstable instances, you can quickly identify the instance that is consuming too many CPU credits.

CPU credit balance

Indicates the CPU credits that are still to be used by this instance.

Number

As mentioned earlier, once a burstable instance is started, it begins consuming Initial CPU credits of 30 that is provisioned to it. While at it, the burstable instance also earns CPU credits at a fixed rate that is determined by the instance type. The amount of CPU credits that a vCPU can earn per hour is based on its baseline performance - i.e., the amount of vCPU capacity that is continuously provisioned to a burstable instance. For example, 25% baseline performance of instance A indicates that the CPU credits that a vCPU of the instance earns per hour can keep the vCPU running at 25% utilization for an hour or at 100% utilization for 15 minutes (60 × 25%). In response to its baseline performance, each vCPU earns 15 CPU credits per hour. Therefore, if instance A has two vCPUs, it earns 30 CPU credits per hour.

If the CPU credits so earned exceed the credits consumed, the net credits are accrued as CPU credit balance. This is the value that is reported by the CPU credit balance measure. A high value is desired for this measure, as a high CPU credit balance for a burstable instance means that CPU resources are guaranteed to that instance for a maximum of 24 hours.

Total disk

Indicates the total number of disks currently used by this instance.

Number

Use the detailed diagnosis of this measure to know which disks are used by the instance, the type of each disk, when every disk was created, the image that stores a copy of that disk's data, and when the disk was attached to the instance.

Disk size

Indicates the total capacity of disks used by this instance.

CPU pending I/O operations

Indicates the percentage of the CPU processes waiting for I/O operations to complete.

Percent

A high value indicates frequent I/O operations on an instance.

Free memory

Indicates the percentage of memory allocated to this instance that is still unused.

Percent

A high value is desired for this measure. A value close to 100% indicates that the instance is running out of memory.

Memory usage

Indicates the percentage of allocated memory that is used by this instance.

Percent

A low value is desired for this measure. A value close to 100% is a cause for concern, as it indicates that the instance is rapidly running out of memory. If the instance appears to be consistently over-utilizing its memory, you may want to consider upgrading to a different instance type to meet with its memory demand.

Average system load

Indicates the average load on this instance during the last 5 minutes.

Percent

A high value indicates that the instance is busy.

Total snapshots

Indicates the total number of snapshots created for disks used by this instance.

Number

The Alibaba Cloud snapshot service allows you to create crash-consistent snapshots for all disk categories. You can use snapshots for the following scenarios:

Disaster recovery and backup: You can create a snapshot for a disk and then use the snapshot to create another disk to implement zone- or geo-disaster recovery.
Environment clone: You can use a system disk snapshot to create a custom image and then use the custom image to create ECS instances with identical environments.
Data development: Snapshots can provide near-real-time production data for applications such as data mining, report queries, and development and testing.
Enhancement of fault tolerance: You can roll a disk back to a previous point in time by using a snapshot to reduce the risk of data loss caused by incorrect operations.

Total snapshot size

Indicates the total size of the snapshots created for the disks used by this instance.

Number