K8s Etcd Test

Etcd is a consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data. In short, Kubernetes uses etcd as its database.

The API server maintains a write-through cache of objects from etcd. A well-tuned cache is essential for reducing direct database accesses and improving the performance of queries. If the cache is not sized right, or if it does not contain adequate number of objects, then the cache may not be able to serve many read requests. This can cause the requests to be routed to the etcd database, thus resulting in expensive database operations and a significant increase in processing overheads. To avoid this, administrators must continuously track the usage of the cache and determine whether/not the cache needs any tuning. This is what the Kube Etcd test does!

This test tracks cache hits and misses, and alerts administrators if the misses far exceed the hits - i.e., if the cache is unable to service a majority of the requests. Additionally, the test also reports the time it takes to insert and read objects from the cache, thus turning administrator attention to potential processing bottlenecks. This way, the test helps administrators measure the usage and processing power of the cache and rapidly identify issues in cache health, so that investigations can be promptly initiated to uncover the reasons behind poor usage and processing ability of the cache.

Target of the test : A Kubernetes Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for the Kubernetes cluster being monitored

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which this test is to be configured.

Port

Specify the port at which the specified Host listens. By default, this is 6443.

Load Balancer / Master Node IP

To run this test and report metrics, the eG agent needs to connect to the Kubernetes API on the master node and run API commands. To enable this connection, the eG agent has to be configured with either of the following:

  • If only a single master node exists in the cluster, then configure the eG agent with the IP address of the master node.
  • If the target cluster consists of more than one master node, then you need to configure the eG agent with the IP address of the load balancer that is managing the cluster. In this case, the load balancer will route the eG agent's connection request to any available master node in the cluster, thus enabling the agent to connect with the API server on that node, run API commands on it, and pull metrics.

By default, this parameter will display the Load Balancer / Master Node IP that you configured when manually adding the Kubernetes cluster for monitoring, using the Kubernetes Cluster Preferences page in the eG admin interface (see Figure 3). The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

Whenever the eG agent runs this test, it uses the IP address that is displayed (by default) against this parameter to connect to the Kubernetes API. If there is any change in this IP address at a later point in time, then make sure that you update this parameter with it, by overriding its default setting.

K8s Cluster API Prefix

By default, this parameter is set to none. Do not disturb this setting if you are monitoring a Kubernetes/OpenShift Cluster.

To run this test and report metrics for Rancher clusters, the eG agent needs to connect to the Kubernetes API on the master node of the Rancher cluster and run API commands. The Kubernetes API of Rancher clusters is of the default format: http(s)://{IP Address of kubernetes}/{api endpoints}. The Server section of the kubeconfig.yaml file downloaded from the Rancher console helps in identifying the Kubernetes API of the cluster. For e.g., https://{IP address of Kubernetes}/k8s/clusters/c-m-bznxvg4w/ is usually the URL of the Kubernetes API of a Rancher cluster.

For the eG agent to connect to the master node of a Rancher cluster and pull out metrics, the eG agent should be made aware of the API endpoints in the Kubernetes API of the Rancher cluster. To aid this, you can specify the API endpoints available in the Kubernetes API of the Rancher cluster against this parameter. In our example, this parameter can be specified as: /k8s/clusters/c-m-bznxvg4w/.

SSL

By default, the Kubernetes cluster is SSL-enabled. This is why, the eG agent, by default, connects to the Kubernetes API via an HTTPS connection. Accordingly, this flag is set to Yes by default.

If the cluster is not SSL-enabled in your environment, then set this flag to No.

Authentication Token

The eG agent requires an authentication bearer token to access the Kubernetes API, run API commands on the cluster, and pull metrics of interest. The steps for generating this token have been detailed in How Does eG Enterprise Monitor a Kubernetes/OpenShift Cluster?

The steps for generating this token for a Rancher cluster has been detailed in How Does eG Enterprise Monitor a Rancher Cluster?

Typically, once you generate the token, you can associate that token with the target Kubernetes cluster, when manually adding that cluster for monitoring using the eG admin interface. The steps for managing the cluster using the eG admin interface are discussed elaborately in How to Monitor the Kubernetes/OpenShift Cluster Using eG Enterprise?

By default, this parameter will display the Authentication Token that you provided in the Kubernetes Cluster Preferences page of the eG admin interface, when manually adding the cluster for monitoring (see Figure 3).

Whenever the eG agent runs this test, it uses the token that is displayed (by default) against this parameter for accessing the API and pulling metrics. If for any reason, you generate a new authentication token for the target cluster at a later point in time, then make sure you update this parameter with the change. For that, copy the new token and paste it against this parameter.

Proxy Host

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the IP address of the proxy server here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Port

If the eG agent connects to the Kubernetes API on the master node via a proxy server, then provide the port number at which that proxy server listens here. If no proxy is used, then the default setting -none - of this parameter, need not be changed,

Proxy Username, Proxy Password, Confirm Password

These parameters are applicable only if the eG agent uses a proxy server to connect to the Kubernetes cluster, and that proxy server requires authentication. In this case, provide a valid user name and password against the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

If no proxy server is used, or if the proxy server used does not require authentication, then the default setting - none - of these parameters, need not be changed.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Cache hits

Indicates the ratio of cache hits to misses.

Percent

Ideally, the value of this measure should be over 80%, as such a value is indicative of a healthy cache. On the other hand, if this measure reports a value less than 50%, it is indicative of poor cache usage. Common causes for a large number of cache misses are:

  • The cache is not sized commensurate to the load on the etcd;
  • The cache does not have many of the objects that are requested;

Rate of objects added to etcd cache

Indicates the time taken to add objects to the etcd cache.

Milliseconds

A very high value for this measure indicates a bottleneck when writing to the cache. Slowness in adding objects to a cache cacne increase the count of cache misses, as it can cause an object to be unavailable in the cache when requested. Typically, if a cache is under-sized, writes to the cache may slow down.

Rate of objects retrieved from etcd cache

Indicates the time taken to retrieve objects from the etcd cache.

Milliseconds

A very high value for this measure indicates a bottleneck when reading from the cache. Slowness in reading objects from a cache is indicative of a request processing bottleneck on the cache. Typically, a severe space contention on the cache can choke reads from the cache, thus slowing down processing.