Etcd Network Traffic Test

The data is exchanged between etcd nodes (peers) in a cluster during operations such as data replication, leader election, and client communication. This includes the traffic required for Raft consensus (synchronizing logs between leader and followers), write operations, and watch notifications. Monitoring etcd's network traffic is essential to ensure healthy cluster communication, detect potential network bottlenecks, and avoid latency issues, which can impact the overall performance and consistency of the etcd key-value store.

Monitoring etcd network traffic is crucial to ensure efficient communication between cluster nodes, especially for Raft consensus and data replication. High latency or bottlenecks can degrade performance, increase write/read latencies, and affect overall etcd reliability and consistency in distributed systems.

This test monitors the network traffic and reveals key metrics which are crucial to understanding network performance like number of bytes sent and received, active and disconnected nodes etc. Administrators can analyze these metrics over time to get valuable insights into current state and potential issues which can cause connectivity to fail.

Target of the test : A Kubernetes Master Node

Agent deploying the test : A remote agent

Outputs of the test : One set of results for target Kubernetes Master node being monitored.

Configurable parameters for the test

Parameter

Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which this test is to be configured.

Port

Specify the port at which the specified Host listens. By default, this is 6443.

Exclude

Specify the duration (in seconds) beyond which the test will timeout in the Timeout text box. The default value is 10 seconds.

Is Full duplex

The Kubernetes Metrics Server is a cluster-wide aggregator of resource usage data. It collects metrics such as CPU and memory usage from each node and pod in the cluster and exposes this information via the Kubernetes API. This data is essential for monitoring the health of your applications and for enabling features like Horizontal Pod Autoscaling.To access metrics in a Kubernetes cluster, you typically use the Kubernetes Metrics Server. Enter the Metric server url in Metric url text box.

Report by connection Id

By default, this flag is set to No. This implies that by default, the network interfaces are identified using their names. On the other hand, if you want the test to identify the network interfaces using their connection IDs instead of the names, then set this flag to Yes. Then, the test will identify the network interfaces using the connection IDs and report metrics for every connection ID.

Show Top

By default, this parameter is set to 10 indicating that the test will report detailed diagnosis only for the top -10 applications that used maximum bandwidth while transferring data over every network interface. Using the information displayed by the detailed diagnosis, you can easily find out the non-critical applications (if any) that are using more bandwidth than the business critical applications and take necessary steps to alleviate the issue. However, you can increase or decrease the value of the Show Top parameter depending upon the level of visibility you require.

Event capture interval in secs

This parameter is applicable only when the Trace flag is set to Yes. By default, the value of this parameter is set to 10 seconds. This setting ensures that the test will only capture the incoming/outgoing traffic during the last 10 seconds of the specified measurement period. Administrators can override the default value if they wish to capture the incoming/outgoing traffic for a longer duration.

Trace

By default, this flag is set to No, indicating that detailed diagnosis is not reported by default for the Incoming Traffic and Outgoing Traffic measures of this test. However, administrators can set this flag to Yes, if detailed diagnosis should be reported for this test.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Total Grpc recieved bytes

Indicates the total number of bytes received by the etcd server via gRPC communication.

Bytes

This includes all incoming data from clients or peer nodes, such as requests for key-value operations, metadata, and watch

TotalGrpcSentBytes

Indicates the total number of bytes sent by the etcd server via gRPC communication.

Bytes

This includes all outgoing data to clients or peer nodes, such as requests for key-value operations, metadata, and watch

TotalRcvBytes

Indicates the total number of bytes received by the etcd server.

Bytes

 

SumRoundTTSec

Indicates the total round-trip time (RTT), in seconds, for a certain set of operations, such as Raft consensus messages or client-server communication.

Milliseconds

This is a cumulative sum of the round-trip times for a series of transactions or network requests between etcd nodes or between clients and the etcd server.

CntRoundTTSec

Indicates the total number of round-trips, for a certain set of operations, such as Raft consensus messages or client-server communication.

Number

 

TotalSentBytes

Indicates the total number of bytes sent by the etcd server.

Bytes

 

TotalSentFailures

Indicates the total failed send requests.

Number

 

ActivePeers

Indicates the set of etcd nodes that are currently participating in the cluster and actively taking part in the Raft consensus protocol.

Number

The health of the cluster depends on having enough active peers to maintain a quorum. A loss of active peers can lead to split-brain scenarios or make the cluster unavailable for writes.

DisconnectedPeers

Indicates the nodes in the etcd cluster that are temporarily unable to communicate with other peers due to network issues, failures, or configuration problems.

Number

Monitoring the number of disconnected peers helps detect when nodes are becoming isolated and may not be able to participate in data replication or consensus.