Redis Cluster Details Test

Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.

Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.

Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:

  • Node A contains hash slots from 0 to 5500.

  • Node B contains hash slots from 5501 to 11000.

  • Node C contains hash slots from 11001 to 16383.

This allows to add and remove nodes in the cluster easily.

In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).

In our example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.

However when the cluster is created (or at a later time) we add a slave node to every master, so that the final cluster is composed of A, B, C that are master nodes, and A1, B1, C1 that are slave nodes. This way, the system is able to continue if node B fails.

Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.

However, note that if nodes B and B1 fail at the same time, Redis Cluster is not able to continue to operate.

To avoid this, administrators must monitor the Redis cluster, understand how many master nodes it is composed of, track the status of hash lots assigned to each node, and be promptly alerted if any hash slot fails. For achieving this, administrators can use the Redis Cluster Details test.

For a cluster-enabled Redis instance, this test reports the composition of the cluster in terms of the number of master nodes and hash slots assigned to the cluster. In addition, the test tracks the status of the hash slots, and notifies administrators if any hash slot fails. Moreover, the test also alerts administrators if any node is added or removed from the cluster.

Target of the test : A Redis server

Agent deploying the test : An internal agent (recommended)

Outputs of the test : One set of results for the cluster-enabled instance

Configurable parameters for the test
Parameters Description

Test period

How often should the test be executed

Host

The host for which the test is to be configured.

Port

The port at which the specified HOST listens.

Redis Password and Confirm Password

In some high security environments, a password may have been set for the Redis server, so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD.

If the Redis server is not password protected, then do not disturb the default setting of this parameter.

To determine whether/not the target Redis server is password-protected, do the following:

  • Login to the system hosting the Redis server.

  • Open the redis.conf file in the <REDIS_INSTALL_DIR>.

  • Look for the requirepass parameter in the file.

  • If this parameter exists, and is not preceded by a # (hash) symbol, it means that password protection is enabled for the Redis server. In this case, the string that follows the requirepass parameter is the password of the Redis server. For instance, say that the requirepass specification reads as follows:

    requirepass red1spr0

    According to this specification, the Redis server is protected using the password red1spr0. In this case therefore, you need to specify red1spr0 against REDIS PASSWORD.

  • On the other hand, if the requirepass parameter is prefixed by the # (hash) symbol as shown below, it means password protection is disabled.

    # requirepass red1spr0

    In this case, leave the REDIS PASSWORD parameter with its default setting.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Cluster enabled for this instance?

Indicates whether/not the cluster feature is enabled for the target Redis instance.

 

If the instance is cluster-enabled, then this measure will report the value Yes. For a cluster-disabled instance, this measure will report the value No.

The numeric values that correspond to these measure values are discussed in the table below:

Measure Value Numeric Value
Yes 1
No 0

Note:

This measure reports the Measure Values listed in the table above to indicate whether/not the target instance is cluster-enabled. The graph of this measure however, indicates the same using the numeric equivalents only.

Current configuration version assigned to this node

Indicates the Config Epoch of this node.

Number

Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch.

An Epoch is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date.

Every master always advertises its configEpoch in ping and pong packets along with a bitmap advertising the set of slots it serves. Slave nodes also advertise the configEpoch field in ping and pong packets, but in the case of slaves the field represents the configEpoch of its master as of the last time they exchanged packets.

A new configEpoch is created during slave election. Slaves trying to replace failing masters increment their epoch and try to get authorization from a majority of masters. When a slave is authorized, a new unique configEpoch is created and the slave turns into a master using the new configEpoch.

Number of hash slots in OK state

Indicates the number of hash slots in the cluster that are in the OK state.

Number

If the value of this measure is the same as the value of the Number of hash slots assigned to cluster measure, it means that all hash slots mapped to all nodes in the cluster are working correctly.

On the other hand, if the value of this measure is much lower than the value of the Number of hash slots assigned to cluster measure, it means that the hash slots mapped to some nodes are in the FAIL or PFAIL state. You may want to look up the values of the Number of hash slots in PFAIL state and Number of hash slots in FAIL state measures to confirm this.

Number of hash slots in PFAIL state

Indicates the number of hash slots that are mapped to a node in PFAIL state.

Number

Ideally, the value of this measure should be very low or 0.

A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and slave nodes can flag another node as PFAIL, regardless of its type.

Note that those hash slots still work correctly, as long as the PFAIL state is not promoted to FAIL by the failure detection algorithm. PFAIL only means that we are currently not able to talk with the node, but may be just a transient error.

Number of hash slots in FAIL state

Indicates the number of hash slots that are mapped to a node in FAIL state.

Number

Every node sends gossip messages to every other node including the state of a few random known nodes. Every node eventually receives a set of node flags for every other node. This way every node has a mechanism to signal other nodes about failure conditions they have detected.

A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:

  • Some node, say node A, has another node B flagged as PFAIL.

  • Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.

  • The majority of masters signaled the PFAIL or FAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time. (The validity factor is set to 2 in the current implementation, so this is just two times the NODE_TIMEOUT time).

If all the above conditions are true, Node A will:

  • Mark the node as FAIL.

  • Send a FAIL message to all the reachable nodes.

Ideally therefore, the value of this measure should be 0.

Number of messages sent via the cluster node-to-node

Indicates the number of messages sent via the cluster node-to-node binary bus.

Number

All the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).

Number of messages received via the cluster node-to-node

Indicates the number of messages received via the cluster node-to-node binary bus.

Number

Cluster state

Indicates the current state of the cluster.

 

This measure can report any of the following values:

  • OK: If the node is able to receive queries, then this measure will report the value OK.

  • Fail: If there is at least one hash slot that is unbound (no node associated), in error state ((node serving it is flagged with FAIL flag), or if the majority of masters can't be reached by this node, then this measure will report the value FAIL.

The numeric values that correspond to the measure values discussed above are as follows:

Measure Value Numeric Value
Fail 0
OK 1

Note:

This measure reports the Measure Values listed in the table above to indicate the cluster state. The graph of this measure however, indicates the same using the numeric equivalents only.

Number of nodes in the cluster

Indicates the number of nodes in the cluster.

Number

To know the details of the nodes in the cluster, use the detailed diagnosis of this measure.

Fail over configuration version assigned to this node

Indicates the local current Epoch variable.

Number

Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch.

An Epoch is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date.

The currentEpoch is a 64 bit unsigned number.

At node creation every Redis Cluster node, both slaves and master nodes, set the currentEpoch to 0.

Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the currentEpoch is updated to the sender epoch.

Because of these semantics, eventually all the nodes will agree to the greatest currentEpoch in the cluster.

This information is used when the state of the cluster is changed and a node seeks agreement in order to perform some action.

Currently this happens only during slave promotion.

Number of hash slots assigned to cluster

Indicates the total number of hash slots assigned to the cluster.

Number

 

Number of master nodes in cluster

Indicates the number of master nodes in the cluster.

Number

 

Number of nodes added to the cluster

Indicates the number of nodes added to the cluster.

Number

Use the detailed diagnosis of this measure to know which nodes were recently added to the cluster.

Number of nodes deleted from the cluster

Indicates the number of nodes deleted from the cluster.

Number

Use the detailed diagnosis of this measure to know which nodes were recently deleted from the cluster.