Ignite TCP Discovery SPI Test

Apache Ignite is a distributed data system and is relies on data distributed over nodes for storage, resilience and scalability. Given the nature of Ignite, one of the key capabilities is to discover and add nodes to the cluster. TCP Discovery SPI defines the network parameters of the default discovery mechanism, which uses the TCP/IP protocol to exchange discovery messages and is implemented in the TcpDiscoverySpi class. The properties of discovery mechanism can be changed by defining the configuration accordingly.

Given the importance of discovery mechanism, it is important to monitor this SPI and ensure that when new nodes are introduced they are quickly added to the system.

This test monitors the TCP Discovery SPI to collect key statistics which can help administrators understand the state of SPI and communication and take action if there is an issue.

Target of the test : Apache Ignite Server

Agent deploying the test : An internal or external agent

Outputs of the test : One set of results for each Apache Ignite Server

Configurable parameters for the test

Parameter

Description

Test period

How often should the test be executed.

Host

Enter the IP address of the Apache Ignite cluster.

Port

Enter the port number on which JMX connector listens to incoming connections requests.

JMX Remote Port

In this text box, enter the name of a virtual warehouse that needs to be monitored. The JMX connector listens on 8686 by default. If it listens on different port in your environment then specify the same.

JMX User

Specify the credentials of the user who is authorized to use JMX.

JMX Password

Specify the password for the authorized user.

Confirm Password

Confirm the password by retyping it here.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

SPI state

Indicates the current state of Service Provider Interface.

Boolean

If the state of SPI is not healthy, the discovery and communication between the nodes will stop. Administrators need to ensure that SPI is working optimally before starting the cluster.

Nodes failed

Indicates the number of nodes which are in failed state.

Number

If there are too many failed nodes in the cluster, which are still there in cluster config, it may slow down the cluster startup.

Nodes joined

Indicates the number of nodes joined since the cluster is started.

Number

If the nodes are able to join and seamlessly integrating in the cluster, it is good sign that SPI is working fine.

Nodes left

Indicates the number of nodes left since the cluster is started.

Number

If too many nodes have left the cluster recently, it may be needed to remove from config o therwise it will slow down the inter node communication.

Pending messages discarded

Indicates the number of messages discarded because the target node could not be discovered.

Number

If there are too many pending messages which are discarded, it means some nodes have left the cluster but cluster config is not updated.

Pending messages registered

Indicates the number of messages which are yet to be delivered to the target node.

Number

If there are too many pending messages which are not discarded yet, it means some nodes have left the cluster but cluster config is not updated.

Total message processed rate

Indicates the total number of messages processed per second through discovery SPI.

Messages/Sec

If this rate is going down over the range of measurements, you need to investigate the same.

Total message received rate

Indicates the total number of messages received per second through discovery SPI.

Messages/Sec

A low value is desired for this measure.

Message worker queue size

Indicates the size of the queue of discovery messages that are waiting to be sent to other nodes.

MB

Worker queue size should be maintained at an optimal value.

Average message processing time

Indicates the average time taken by each message to process through the system.

Seconds

Look at the trends and if the processing time is going up over a range of measurements, it would be a matter of concern.