K8s Node Groups Test

Kubernetes node groups are collections of worker nodes with identical configurations, such as instance type, OS, and labels. They enable unified management, scaling, and updates of nodes. Multiple node groups let clusters run diverse workloads efficiently, balance cost, and improve reliability across different environments or zones.

Monitoring Kubernetes node groups ensures cluster health, performance, and cost efficiency. It helps detect resource bottlenecks, node failures, or imbalance early. Continuous observation supports autoscaling decisions, workload distribution, and prevents downtime by maintaining optimal node utilization and system stability across the cluster.

This test monitors the node groups and collect key metrics like number of total nodes, running nodes, not running nodes etc. These metrics provide administrators with crucial insights into group health, and node usage. It enables early detection of issues, faster troubleshooting, and improved reliability. By tracking the status of nodes, administrators can optimize nodes allocation in the group, plan capacity, and ensure consistent, cost-effective operation of Kubernetes.

Target of the test : Kubernetes Service Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each node group being monitored.

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which this test is to be configured.

Port

Specify the port at which the specified Host listens. By default, this is 6443.

Load Balancer / Master Node IP

To run this test and report metrics, the eG agent needs to connect to the Kubernetes API on the master node and run API commands. To enable this connection, the eG agent has to be configured with either of the following:

  • If only a single master node exists in the cluster, then configure the eG agent with the IP address of the master node.
  • If the target cluster consists of more than one master node, then you need to configure the eG agent with the IP address of the load balancer that is managing the cluster. In this case, the load balancer will route the eG agent's connection request to any available master node in the cluster, thus enabling the agent to connect with the API server on that node, run API commands on it, and pull metrics.

By default, this parameter will display the Load Balancer / Master Node IP that you configured when manually adding the Kubernetes/OpenShift cluster for monitoring, using the Kubernetes Cluster Preferences page in the eG admin interface (see Figure 3). The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

Whenever the eG agent runs this test, it uses the IP address that is displayed (by default) against this parameter to connect to the Kubernetes API. If there is any change in this IP address at a later point in time, then make sure that you update this parameter with it, by overriding its default setting.

SSL

By default, the Kubernetes/OpenShift cluster is SSL-enabled. This is why, the eG agent, by default, connects to the Kubernetes API via an HTTPS connection. Accordingly, this flag is set to Yes by default.

If the cluster is not SSL-enabled in your environment, then set this flag to No.

Authentication Token

The eG agent requires an authentication bearer token to access the Kubernetes API, run API commands on the cluster, and pull metrics of interest. The steps for generating this token have been detailed in How Does eG Enterprise Monitor a Kubernetes/OpenShift Cluster?

Typically, once you generate the token, you can associate that token with the target Kubernetes/OpenShift cluster, when manually adding that cluster for monitoring using the eG admin interface. The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

By default, this parameter will display the Authentication Token that you provided in the Kubernetes Cluster Preferences page of the eG admin interface, when manually adding the cluster for monitoring (see Figure 3).

Whenever the eG agent runs this test, it uses the token that is displayed (by default) against this parameter for accessing the API and pulling metrics. If for any reason, you generate a new authentication token for the target cluster at a later point in time, then make sure you update this parameter with the change. For that, copy the new token and paste it against this parameter.

Confirm Authentication Token

Confirm the authentication token by retyping again.

Node groups based on labels

The nodes group are not native capability of Kubernetes, they are created by labeling the nodes. Specify the node labels in specified textbox. If there are more than one labels, separate them by comma. You can also specify the wildcard format to match all label matching given prefix.

Namespace to Monitor

To enable the eG agent to monitor a specific Namespace on Kubernetes/OpenShift cluster, specify the name of that Namespace against this parameter. For instance, eshop. Doing so will enable the eG agent to monitor and report metrics specific to this Namespace.

Proxy Host

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the IP address of the proxy server here. If no proxy is used, then the default setting -none - of this parameter, need not be changed.

Proxy Port

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the port number at which that proxy server listens here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Username, Proxy Password, Confirm Password

These parameters are applicable only if the eG agent uses a proxy server to connect to the Kubernetes/OpenShift cluster, and that proxy server requires authentication. In this case, provide a valid user name and password against the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

If no proxy server is used, or if the proxy server used does not require authentication, then the default setting - none - of these parameters, need not be changed.

Kubernetes version

The Version text box indicates the version of the Kubernetes/OpenShift cluster to be managed. The default value is none. If the value of this parameter is not "none", the test uses the value provided (e.g., 28.1) as the Kubernetes version.

Timeout

Specify the duration (in seconds) for which this test should wait for a response from the Kubernetes/OpenShift cluster. If there is no response from the cluster beyond the configured duration, the test will timeout. By default, this is set to 5 seconds.

DD Frequency

Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Total nodes

Indicates the total number of nodes in this node group.

Number

Administrators can define the total number of nodes in a group based on their workload needs and cluster configuration. They need to ensure that nodes are evenly distributed in groups, and certain groups are not too dense while the others are sparse.

Detailed diagnosis of this measure will provide detailed information about each node in the group. Measures reported by detailed diagnosis are: Node name, Status, Age in days, Labels, OS Image name, Operating System, Kernal Version, System UUID Address, Hostname, Kubelet Version and Container Runtime

Recently added nodes

Indicates the number of recently added nodes in this node group during the last measurement period.

Number

Nodes are added in the group if the workload is increasing or expected to increase. Once the nodes are added administrators need to observe the workload and node usage and ensure that newly added nodes are utilized.

Detailed diagnosis of this measure will provide detailed information about each of the recently added node in the group. Measures reported by detailed diagnosis are: Node name, Status, Age in days, Labels, OS Image name, Operating System, Kernal Version, System UUID Address, Hostname, Kubelet Version and Container Runtime

Recently removed nodes

Indicates the number of nodes recently removed from this node group during the last measurement period.

Number

Nodes are removed from the group if they are no longer used. It can be because of reduced workload, or if the nodes are not running/malfunctioning. Administrators may need to review and analyze the reason of removal and take necessary action if workload still requires the nodes.

Detailed diagnosis of this measure will provide detailed information about each of the recently removed node from the group. Measures reported by detailed diagnosis are: Node name, Status, Age in days, Labels, OS Image name, Operating System, Kernal Version, System UUID Address, Hostname, Kubelet Version and Container Runtime.

Running nodes

Indicates the total number of nodes currently running in the this node group.

Number

In a group all nodes are expected to be running unless some of them are added for future use when additional workload arrives. Administrators need to verify the same.

Detailed diagnosis of this measure will provide detailed information about each of the running node in the group. Measures reported by detailed diagnosis are: Node name, Status, Age in days, Labels, OS Image name, Operating System, Kernal Version, System UUID Address, Hostname, Kubelet Version and Container Runtime.

Not running nodes

Indicates the total number of nodes not running in the this node group.

Number

Administrators would be most concerned about the nodes which are expected to be running but are not. If the number of not running nodes is less than expectation, admins should immediately look into the details of which nodes are not running and take corrective actions.

Detailed diagnosis of this measure will provide detailed information about each of the not running node in the group. Measures reported by detailed diagnosis are: Node name, Status, Age in days, Labels, OS Image name, Operating System, Kernal Version, System UUID Address, Hostname, Kubelet Version and Container Runtime.