Cassandra Read Repairs Test

Over time, data in a replica can become inconsistent with other replicas due to the distributed nature of the Cassandra database. Node repair corrects the inconsistencies so that eventually all nodes have the same and most up-to-date data. It is important part of regular maintenance for every Cassandra cluster. Read repair improves consistency in a Cassandra cluster with every read request.

In a read, the coordinator node sends a data request to one replica node and digest requests to others for consistency level (CL) greater than ONE. If all nodes return consistent data, the coordinator returns it to the client.

In read repair, Cassandra sends a digest request to each replica not directly involved in the read. Cassandra compares all replicas and writes the most recent version to any replica node that does not have it. If the query's consistency level is above ONE, Cassandra performs this process on all replica nodes in the foreground before the data is returned to the client. Read repair repairs any node queried by the read. This means that for a consistency level of ONE, no data is repaired because no comparison takes place. For QUORUM, only the nodes that the query touches are repaired, not all nodes.

There are three types of read requests that a coordinator can send to a replica:

A direct read request
A digest request
A background read repair request

In a direct read request, the coordinator node contacts one replica node. Then the coordinator sends a digest request to a number of replicas determined by the consistency level specified by the client. The digest request checks the data in the replica node to make sure it is up to date. Then the coordinator sends a digest request to all remaining replicas. If any replica nodes have out of date data, a background read repair request is sent. Read repair requests ensure that the requested row is made consistent on all replicas involved in a read query.

In some environments, at times, due to network issues or due to failure of multiple nodes, the data may not be replicated to all nodes. If suppose a node becomes available after a short hiatus, there may be a sudden influx of read repair requests to the node so that the outdated data in the node can be replaced with the data that is up to date. This sudden influx of read repairs is a cause of concern when the nodes are not updated at regular intervals. Therefore, it is essential to monitor the read repairs frequently. To identify such erratic behavior in read repair requests and the background repairs performed in the nodes, administrators can use the Cassandra Read Repairs test.

Using this test, administrators can figure out the count of background read repairs coordinated by the node and the read repairs attempted by the node. In addition, this test also reveals the full data digests coordinated by the node. By analyzing the count of read repairs at regular intervals, administrators can figure out the erratic pattern of read repairs performed on the node and figure out the exact cause of such erratic behavior in updating the node with the latest data information.

Target of the test : A Cassandra Database

Agent deploying the test : An external/remote agent

Outputs of the test : One set of results for the target Cassandra Database node being monitored.

Configurable parameters for the test
Parameters	Description
Test Period	How often should the test be executed.
Host	The host for which the test is to be configured.
Port	The port on which the specified host listens. By default, this is 9042.
JMX Remote Port	Here, specify the port at which the JMX listens for requests from remote hosts. Ensure that you specify the same port that you configured in the cassandra-env.sh file (if the target Cassandra Database node is installed on a Unix host) or the cassandra-env.ps1 file (if the target Cassandra Database node is installed on a Windows host) in the <CASSANDRA_HOME> directory used by the target Cassandra Database node. To know how to specify the remote port, refer to Enabling JMX Support for JRE.
JMX User and JMX Password	If JMX requires authentication only (but no security), then ensure that the user and password parameters are configured with the credentials of a user with read-write access to JMX. To know how to create this user, refer to Configuring the eG Agent to Support JMX Authentication.
Confirm Password	Confirm the Password by retyping it in this text box.

Measurements made by the test
Measurement	Description	Measurement Unit	Interpretation
Background repaired	Indicates the number of background read repairs coordinated by the node.	Number	A high value for this measure indicates that the data requested is out of date and is being updated. This may indicate that there was a delay in update process or a sudden unavailability of the node due to network issues etc.
Read repair attempts	Indicates the number of read repairs attempted by the node.	Number	A high value indicates that the read repair is carried out to update the replica nodes that are out-of-date. This may also indicate that there was a delay in update process or a sudden unavailability of the node due to network issues etc.
Repaired blocking	Indicates the number of full data digests coordinated by the node.	Number	On any consistency level that involves more than one node (i.e., all except ANY and ONE), if the read digests do not match up, read repair is done in a blocking fashion before returning results. This means that if the repair does not complete on time, the read requests fail.