K8s Cluster Overview Test

A Kubernetes/OpenShift cluster is a set of machines, called nodes, that run containerized applications managed by Kubernetes/OpenShift. A cluster has at least one worker node and at least one master node.

The worker node(s) host the pods that are the components of the application. The master node(s) manages the worker nodes and the pods in the cluster. Multiple master nodes are used to provide a cluster with failover and high availability.

The kube-scheduler schedules Pods to a node, based on the resource capacity of the node and the resource requirements of the containers in the Pods. To ensure that no Pod hogs the node's resources, resource requests and limits can be set per container.

At any given point in time, an administrator needs to have a macro view of the composition of their Kubernetes/OpenShift cluster - i.e., the number of nodes and Pods in the cluster - and the operational state of the nodes and Pods. This will help them quickly spot nodes and Pods that have failed - i.e., it will help them quickly detect a mismatch between the actual state of the cluster and its desired state. By taking appropriate action on such mismatches, administrators can prevent any adverse impact on the availability and performance of containerized applications. Additionally, administrators also need to track how the Pods are utilizing the cluster's compute resources. This way, they can proactively detect probable resource contentions / over-subscriptions, and rapidly initiate measures to right-size the cluster components (i.e., Pods and containers), so that application performance is not affected by resource crunches. Administrators also require an overview of Deployments across the cluster, so that they can easily locate problem areas. The Kube Cluster Overview test provides administrators with all these useful high-level insights!

This test monitors a Kubernetes/OpenShift cluster, reports the total count of nodes in the cluster, and also precisely pinpoints the master and worker nodes of the cluster. The test also tracks the Pod capacity of the cluster alongside Pod allocations, and additionally highlights Pods and nodes in an abnormal state. This enables administrators to rapidly detect any glaring mismatch between the desired state and actual state of the cluster and initiate appropriate remedial measures. Furthermore, the test reveals how the Pods in the cluster are utilizing the cluster's compute resource capacity. In the process, the test brings to light irregularities such as resource over-subscription and current/potential resource contention. Detailed diagnostics provided by the test lead administrators to the exact Pods that are hogging cluster resources, or have been poorly sized. This way, the test points administrators to those Pods for which resource allocations need to be fine-tuned to ensure optimal cluster performance. In addition, the test helps administrators easily compare the desired state of Deployments with the actual state, so that they can instantly capture and resolve discrepancies (if any).

Target of the test : A Kubernetes/OpenShift Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for the Kubernetes/OpenShift cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which this test is to be configured.

Port

Specify the port at which the specified Host listens. By default, this is 6443.

Load Balancer / Master Node IP

To run this test and report metrics, the eG agent needs to connect to the Kubernetes API on the master node and run API commands. To enable this connection, the eG agent has to be configured with either of the following:

  • If only a single master node exists in the cluster, then configure the eG agent with the IP address of the master node.
  • If the target cluster consists of more than one master node, then you need to configure the eG agent with the IP address of the load balancer that is managing the cluster. In this case, the load balancer will route the eG agent's connection request to any available master node in the cluster, thus enabling the agent to connect with the API server on that node, run API commands on it, and pull metrics.

By default, this parameter will display the Load Balancer / Master Node IP that you configured when manually adding the Kubernetes/OpenShift cluster for monitoring, using the Kubernetes Cluster Preferences page in the eG admin interface (see Figure 3). The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

Whenever the eG agent runs this test, it uses the IP address that is displayed (by default) against this parameter to connect to the Kubernetes API. If there is any change in this IP address at a later point in time, then make sure that you update this parameter with it, by overriding its default setting.

SSL

By default, the Kubernetes/OpenShift cluster is SSL-enabled. This is why, the eG agent, by default, connects to the Kubernetes API via an HTTPS connection. Accordingly, this flag is set to Yes by default.

If the cluster is not SSL-enabled in your environment, then set this flag to No.

Authentication Token

The eG agent requires an authentication bearer token to access the Kubernetes API, run API commands on the cluster, and pull metrics of interest. The steps for generating this token have been detailed in How Does eG Enterprise Monitor a Kubernetes/OpenShift Cluster?

Typically, once you generate the token, you can associate that token with the target Kubernetes cluster, when manually adding that cluster for monitoring using the eG admin interface. The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

By default, this parameter will display the Authentication Token that you provided in the Kubernetes Cluster Preferences page of the eG admin interface, when manually adding the cluster for monitoring (see Figure 3).

Whenever the eG agent runs this test, it uses the token that is displayed (by default) against this parameter for accessing the API and pulling metrics. If for any reason, you generate a new authentication token for the target cluster at a later point in time, then make sure you update this parameter with the change. For that, copy the new token and paste it against this parameter.

Proxy Host

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the IP address of the proxy server here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Port

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the port number at which that proxy server listens here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Username, Proxy Password, Confirm Password

These parameters are applicable only if the eG agent uses a proxy server to connect to the Kubernetes/OpenShift cluster, and that proxy server requires authentication. In this case, provide a valid user name and password against the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

If no proxy server is used, or if the proxy server used does not require authentication, then the default setting - none - of these parameters, need not be changed.

DD Frequency

Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Total nodes

Indicates the total number of nodes in the cluster.

Number

 

Master nodes

Indicates the count of master nodes in the cluster.

Number

Use the detailed diagnosis of this measure to know which are the master nodes in the cluster.

Worker nodes

Indicates the number of worker nodes in the cluster.

Number

Use the detailed diagnosis of this measure to know which are the worker nodes in the cluster.

Nodes added to cluster

Indicates the number of nodes that were added to the cluster since the last measurement period.

Number

Use the detailed diagnosis of this measure to know which nodes were recently added to the cluster.

Nodes removed from cluster

Indicates the number of nodes that were removed from the cluster since the last measurement period.

Number

Use the detailed diagnosis of this measure to know which nodes were recently removed from the cluster.

Running nodes

Indicates the number of nodes in the cluster that are currently running.

Number

 

Not running nodes

Indicates the number of nodes in the cluster that are not running presently.

Number

Use the detailed diagnosis of this measure to know which nodes are not running and why.

Unknown nodes

Indicates the number of nodes in the cluster that are in the Unknown presently.

Number

Use the detailed diagnosis of this measure to know which nodes are in an Unknown state and why.

Pods capacity

Indicates the maximum number of Pods that can be created on the nodes in the cluster.

Number

 

Allocated pods

Indicates the number of Pods that have been scheduled to nodes in the cluster.

Number

If the value of this measure is equal to or close to the value of the Pods capacity measure, it indicates that the cluster has or is about to exhaust its capacity. In such a situation, you may want to add more nodes to your cluster or increase the Pod capacity of your cluster.

Running pods

Indicates the number of Pods in the cluster that are in the Running state currently.

Number

If a Pod is in the Running state, it means that the Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting.

Use the detailed diagnosis of this measure to know which Pods are in the Running state.

Pending pods

Indicates the number of Pods in the cluster that are in the Pending state currently.

Number

If a Pod is in the Pending state, it means that the Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. This includes time before being scheduled as well as time spent downloading images over the network, which could take a while.

If a pod is stuck in Pending it means that it can not be scheduled onto a node. Generally this is because there are insufficient resources of one type or another that prevent scheduling. If this is the case, do the following:

  • Add more nodes to the cluster.
  • Terminate unneeded pods to make room for pending pods.
  • Check that the pod is not larger than your nodes. For example, if all nodes have a capacity of cpu:1, then a pod with a request of cpu: 1.1 will never be scheduled.

Use the detailed diagnosis of this measure to know which Pods are in the Pending state.

Succeeded pods

Indicates the number of Pods in the cluster that are in the Succeeded state currently.

Number

If a Pod is in the Succeeded state, it means that all Containers in the Pod have terminated in success, and will not be restarted.

Failed pods

Indicates the number of Pods in the cluster that are in the Failed state currently.

Number

If a Pod is in the Failed state, it means that all Containers in the Pod have terminated, and at least one Container has terminated in failure. That is, the Container either exited with non-zero status or was terminated by the system.

Use the detailed diagnosis of this measure to know which Pods are in the Failed state.

Ideally, the value of this measure should be 0.

Unknown pods

Indicates the number of Pods in the cluster that are in the Unknown state currently.

Number

If a Pod is in the Unknown state, it means that the state of the Pod could not be obtained, probably due to an error in communicating with the host of the Pod.

Ideally, the value of this measure should be 0.

Running pods utilization

Indicates the percentage of Pods in the cluster that are in a Running state currently.

Percent

The formula used for computing this measure is as follows:

[Running pods/Pods capacity]*100

Ideally, the value of this measure should be high.

Total CPUs

Indicates the total number of CPU cores supported by the cluster.

Number

 

CPU capacity

Indicates the total CPU capacity of the cluster.

Millicpu

 

CPU requests

Indicates the minimum CPU resources guaranteed to the Pods in the cluster.

Millicpu

This is the sum of CPU requests configured for all containers in all Pods across nodes in the cluster.

A request is the amount of that resource that the system will guarantee to a Pod.

CPU limits

Indicates that maximum amount of CPU resources that the Pods in the cluster can use.

Millicpu

This is the sum of CPU limits set for all containers in all Pods across nodes in the cluster.

A limit is the maximum amount that the system will allow the Pod to use.

CPU limits allocation

Indicates what percentage of the CPU capacity of the cluster is allocated as CPU limits to containers. In other words, this is the percentage of a cluster's CPU capacity that the containers are allowed to use.

Percent

The formula used for computing this measure is as follows:

(CPU limits/CPU capacity)*100

If the value of this measure exceeds 100%, it means that one/more Pods are probably over-subscribing to the capacity of one/more nodes.

CPU requests allocation

Indicates what percentage of the total CPU capacity of the cluster is set as CPU requests for the containers in the cluster. In other words, this is the percentage of a cluster's CPU capacity that the containers on the cluster are guaranteed to receive.

Percent

The formula used for computing this measure is as follows:

(CPU requests/CPU capacity )*100

If the value of this measure is unusually high, then you can use the detailed diagnosis of this measure to review the CPU requests configured for each Pod in the cluster. In the process, you can accurately identify the Pod for which the maximum amount of CPU resources in the cluster is guaranteed - i.e., the Pod that is hogging the CPU capacity of the cluster.

Memory capacity

Indicates the total memory capacity of the cluster.

GB

 

Memory requests

Indicates the minimum memory resources guaranteed to the Pods in the cluster.

GB

This is the sum of memory requests configured for all containers in all Pods across nodes in the cluster.

A request is the amount of that resource that the system will guarantee to the Pod.

Memory limits

Indicates the maximum amount of memory resources that the Pods in the cluster can use.

GB

This is the sum of memory limits set for all containers in all Pods across nodes in the cluster.

A limit is the maximum amount that the system will allow the Pod to use.

Memory limits allocation

Indicates what percentage of the memory capacity of the cluster is allocated as memory limits to containers in the cluster. In other words, this is the percentage of a cluster's memory capacity that the containers on the cluster are allowed to use.

Percent

The formula used for computing this measure is as follows:

(Memory limits/Memory capacity)*100

If the value of this measure exceeds 100%, it means that one/more Pods are probably over-subscribing to the capacity of one/more nodes in the cluster.

Memory requests allocation

Indicates what percentage of the total memory capacity of the cluster is set as memory requests for the containers in the cluster. In other words, this is the percentage of a cluster's memory capacity that the containers in the cluster are guaranteed to receive.

Percent

The formula used for computing this measure is as follows:

(Memory requests/Memory capacity)*100

If the value of this measure is unusually high, then you can use the detailed diagnosis of this measure to review the memory requests configured for each Pod in the cluster. In the process, you can accurately identify the Pod for which the maximum amount of memory resources in the cluster is guaranteed - i.e., the Pod that is hogging the memory capacity of the cluster.

Total pods with updated deployment

Indicates the total number of non-terminated Pod replicas in the cluster that have been updated with changes (if any) made to Pod template specifications.

Number

Typically, whenever changes are made to a Deplopyment's Pod template - say, labels or container images of the template are changed - then a Deployment rollout is triggered. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate.

Ideally, the value of this measure should be the same as the value of the Total pods with deployment measure. If not, then it means that the desired number of Pod replicas are not yet fully updated with the changes to the Pod template.

Ready pods with deployment

Indicates the number of ready Pods created in the cluster across Deployments.

Number

 

Total available pods with deployment

Indicates the number of available Pods created in the cluster across Deployments.

Number

A Pod is said to be Available, if it is ready without any containers crashing for at least the duration configured against minReadySeconds in the Pod specification.

Ideally, the value of this measure should be the same as the value of the Total pods with deployment measure. This means that the desired state of the Deployments is not the same as their actual state.

Total unavailable pods with deployment

Indicates the total number of unavailable Pods created in the cluster across Deployments.

Number

Any Pod that is not ready, or is ready but has containers crashing for a period of time beyond the minReadySeconds duration, is automatically considered Unavailable.

Ideally, the value of this measure should be 0. If this measure reports a non-zero value or a value equal to or close to the value of the Total pods with deployment measure, it means that the desired state of the Deployments is not the same as their actual state.

Total pods with deployment

Indicates the total number of Pods created in the cluster across Deployments.

Number

 

Total CPU usage

Indicates the total CPU utilization of the cluster.

Millicpu

Use the detailed diagnosis of this measure to figure out the CPU utilization of each node in the cluster.

 

Average CPU utilization

Indicates the CPU utilized by the cluster, expressed in percent.

Percent

Total memory usage

Indicates the total memory utilization of the cluster.

GB

Use the detailed diagnosis of this measure to figure out the memory utilized by each node in the cluster.

 

Average Memory utilization

Indicates the memory utilized by the cluster, expressed in percent.

Percent

Total images

Indicates the total number of images on the cluster.

Number

 

Total used images

Indicates the total number of images currently used by the containers on the cluster.

Number

The detailed diagnosis of this measure lists the name of the images that are used, the Image ID of each image, the size of each image and the node on which each image resides.

Not used images

Indicates the number of images still to be used by the containers on the cluster.

Number

The detailed diagnosis of this measure lists the name of the images that are unused, the Image ID of each image and the size of each image.

Total images size

Indicates the total size of images on the cluster.

GB

 

Nodes with disk pressure condition

Indicates the number of nodes (on the cluster) that are low on disk capacity.

Number

The detailed diagnosis of this measure indicates the name of the nodes that are low on disk capacity, the reason and the message.

Nodes with memory pressure condition

Indicates the number of nodes (on the cluster) that are running low on memory.

Number

The detailed diagnosis of this measure indicates the name of the nodes that are running low on memory, the reason and the message.

Nodes with out of disk condition

Indciates the number of nodes (on the cluster) that do not have sufficient free disk space to add new Pods.

Number

The detailed diagnosis of this measure indicates the name of the nodes that do not have sufficient free disk space to add new Pods, the reason and the message.

Nodes with PID pressure condition

Indicates the number of nodes (on the cluster) on which too many processes are running.

Number

The detailed diagnosis of this measure indicates the name of the nodes on which too many processes are running, the reason and the message.

Nodes with network unavailable condition

Indicates the number of nodes (on the cluster) on which network is not correctly configured.

Number

The detailed diagnosis of this measure indicates the name of the nodes on which network is not correctly configured, the reason and the message.

Total deployments

Indicates the total number of Deployments in the cluster.

Number

 

Total services

Indicates the total number of services in the cluster.

Number

 

Total daemonsets

Indicates the total number of daemonsets in the cluster.

Number

 

Total namespaces

Indicates the total number of namespaces in the cluster.

Number

 

Zombie pods

Indicates the number of Pods in the cluster that are in the Zombie state currently.

Number

A Zombie Pod is simply said to be still running, but not doing any “work”. Pods in the zombie state exhibit three symptoms: the Pods had a status of “Running”, the applications stopped writing any log output, and there were no application metrics being sent to Prometheus.

Use the detailed diagnosis of this measure to identify the name of the Zombie Pods and the nodes to which the Zombie Pods belong to.

Total statefulsets

Indicates the total number of StatefuleSets in the cluster.

Number

StatefulSet is the workload API object used to manage stateful applications. StatefulSet manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods.

StatefulSets are valuable for applications that require one or more of the following.

  • Stable, unique network identifiers.

  • Stable, persistent storage.

  • Ordered, graceful deployment and scaling.

  • Ordered, automated rolling updates.

Use the detailed diagnosis of the Master nodes measure to know which are the master nodes in the cluster.

Figure 1 : The detailed diagnosis of the Master nodes measure

 

Use the detailed diagnosis of the Worker nodes measure to know which are the worker nodes in the cluster.

Figure 2 : The detailed diagnosis of the Worker nodes measure

 

Use the detailed diagnosis of the Nodes added to cluster measure to know which nodes were recently added to the cluster.

Figure 3 : The detailed diagnosis of the Nodes added to cluster measure

 

Use the detailed diagnosis of the Nodes removed from cluster measure to know which nodes were recently removed from the cluster.

Figure 4 : The detailed diagnosis of the Nodes removed from cluster measure

 

Use the detailed diagnosis of the Nodes not running measure to know which nodes are not running and why.

Figure 5 : The detailed diagnosis of the Nodes not running measure

 

Use the detailed diagnosis of the Unknown nodes measure to know which nodes are in an Unknown state and why.

Figure 6 : The detailed diagnosis of the Unknown nodes measure

 

Use the detailed diagnosis of the Running pods measure to know which Pods are in the Running state and which node each running Pod is scheduled to.

Figure 7 : The detailed diagnosis of the Running pods measure reported by the Kube Cluster Overview test

 

Use the detailed diagnosis of the Pending pods measure to know which Pods are in the Pending state and which node each pending Pod is scheduled to.

Figure 8 : The detailed diagnosis of the Pending pods measure reported by the Kube Cluster Overview test

 

If the value of the CPU requests allocation measure is unusually high, then you can use the detailed diagnosis of this measure to review the CPU requests configured for each Pod in the cluster. In the process, you can accurately identify the Pod that is guaranteed to receive the maximum amount of CPU resources in the cluster - i.e., the Pod that is hogging the CPU capacity of the cluster.

Figure 9 : The detailed diagnosis of the CPU requests allocation measure reported by the Kube Cluster Overview test

 

If the value of the Memory requests allocation measure is unusually high, then you can use the detailed diagnosis of this measure to review the memory requests configured each Pod in the cluster. In the process, you can accurately identify the Pod that is guaranteed to receive the maximum amount of memory resources in the cluster - i.e., the Pod that is hogging the memory capacity of the cluster.

Figure 10 : The detailed diagnosis of the Memory request allocation measure reported by the Kube Cluster Overview test