AWS MSK CPU Test

Excess traffic to AWS MSK clusters can impose a prohibitive load on the AWS MSK server, choking the CPU. To proactively avoid such bottlenecks, you have to constantly monitor the CPU utilization of the clusters in the target AWS MSK server. This is where the AWS MSK CPU test helps. This test tracks the CPU usage of each cluster in the target server over time, and alerts you to potential CPU contentions, so that sudden spikes in CPU usage can be promptly captured.

Target of the test : AWS Managed Service Kafka

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each cluster executing in the target AWS Managed Service Kafka server.

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the AWS Managed Service Kafka Broker that is being monitored.

Port

Specify the port number at which the specified HOST listens. By default, this is NULL.

AWS Default Region

This test uses AWS CLI to interact with AWS Managed Service Kafka and pull relevant metrics. To enable the test to connect to AWS, you need to configure the test with the name of the region to which all requests for metrics should be routed, by default. Specify the name of this AWS Default Region, here.

AWS Access Key ID, AWS Secret Access Key and Confirm Password

To monitor AWS Managed Service Kafka, the eG agent has to be configured with the access key and secret key of a user with a valid AWS account. For this purpose, we recommend that you create a special user on the AWS cloud, obtain the access and secret keys of this user, and configure this test with these keys. The procedure for this has been detailed in the Obtaining an Access key and Secret key topic. Make sure you reconfirm the access and secret keys you provide here by retyping it in the corresponding Confirm Password text box.

Timeout Seconds

Specify the maximum duration (in seconds) for which the test will wait for a response from the server. The default is 10 seconds.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

CPU-Credit balance

Indicates the CPU credit balance on the brokers.

Number

Once a burstable instance is started, it begins consuming Initial CPU credits of 30 that is provisioned to it. While at it, the burstable instance also earns CPU credits at a fixed rate that is determined by the instance type. The amount of CPU credits that a CPU can earn per hour is based on its baseline performance - i.e., the amount of CPU capacity that is continuously provisioned to a burstable instance. For example, 25% baseline performance of instance A indicates that the CPU credits that a CPU of the instance earns per hour can keep the CPU running at 25% utilization for an hour or at 100% utilization for 15 minutes (60 × 25%). In response to its baseline performance, each CPU earns 15 CPU credits per hour. Therefore, if instance A has two CPUs, it earns 30 CPU credits per hour.

If the CPU credits so earned exceed the credits consumed, the net credits are accrued as CPU credit balance. This is the value that is reported by the CPU credit balance measure. A high value is desired for this measure, as a high CPU credit balance for a burstable instance means that CPU resources are guaranteed to that instance for a maximum of 24 hours.

CPU-Idle

Indicates the percentage of time that the CPU spent in an idle state.

Percent

If the CPU wait time measure is abnormally high, then compare the value of this measure with that of the Swap wait measure to know where the CPU spent maximum time - waiting for swapping? in the idle state? or waiting for an I/O operation?

CPU-System

Indicates the percentage of CPU in kernel space.

Percent

 

CPU-User

Indicates the percentage of CPU in user space.

Percent

 

CPU-Credit usage

Indicates the CPU credit usage on the instances.

Number

If your CPU usage is sustained above the baseline level of 20% you can run out of the CPU credit balance which can have a negative impact on cluster performance. This measure value is monitored and corrective actions to be taken when alerted.