Redis Cluster Details Test
Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.
Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).
Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.
Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:
-
Node A contains hash slots from 0 to 5500.
-
Node B contains hash slots from 5501 to 11000.
-
Node C contains hash slots from 11001 to 16383.
This allows to add and remove nodes in the cluster easily.
In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).
In our example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.
However when the cluster is created (or at a later time) we add a slave node to every master, so that the final cluster is composed of A, B, C that are master nodes, and A1, B1, C1 that are slave nodes. This way, the system is able to continue if node B fails.
Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.
However, note that if nodes B and B1 fail at the same time, Redis Cluster is not able to continue to operate.
To avoid this, administrators must monitor the Redis cluster, understand how many master nodes it is composed of, track the status of hash lots assigned to each node, and be promptly alerted if any hash slot fails. For achieving this, administrators can use the Redis Cluster Details test.
For a cluster-enabled Redis instance, this test reports the composition of the cluster in terms of the number of master nodes and hash slots assigned to the cluster. In addition, the test tracks the status of the hash slots, and notifies administrators if any hash slot fails. Moreover, the test also alerts administrators if any node is added or removed from the cluster.
Target of the test : A Redis server
Agent deploying the test : An internal agent (recommended)
Outputs of the test : One set of results for the cluster-enabled instance
Parameters | Description |
---|---|
Test period |
How often should the test be executed |
Host |
The host for which the test is to be configured. |
Port |
The port at which the specified HOST listens. |
Redis Password and Confirm Password |
In some high security environments, a password may have been set for the Redis server, so as to protect it from unauthorized accesses/abuse. If such a password has been set for the monitored Redis server, then specify that password against REDIS PASSWORD. Then, confirm the password by retyping it against CONFIRM PASSWORD. If the Redis server is not password protected, then do not disturb the default setting of this parameter. To determine whether/not the target Redis server is password-protected, do the following:
|
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise suite embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation | ||||||
---|---|---|---|---|---|---|---|---|---|
Cluster enabled for this instance? |
Indicates whether/not the cluster feature is enabled for the target Redis instance. |
|
If the instance is cluster-enabled, then this measure will report the value Yes. For a cluster-disabled instance, this measure will report the value No. The numeric values that correspond to these measure values are discussed in the table below:
Note: This measure reports the Measure Values listed in the table above to indicate whether/not the target instance is cluster-enabled. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||
Current configuration version assigned to this node |
Indicates the Config Epoch of this node. |
Number |
Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch. An Epoch is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date. Every master always advertises its configEpoch in ping and pong packets along with a bitmap advertising the set of slots it serves. Slave nodes also advertise the configEpoch field in ping and pong packets, but in the case of slaves the field represents the configEpoch of its master as of the last time they exchanged packets. A new configEpoch is created during slave election. Slaves trying to replace failing masters increment their epoch and try to get authorization from a majority of masters. When a slave is authorized, a new unique configEpoch is created and the slave turns into a master using the new configEpoch. |
||||||
Number of hash slots in OK state |
Indicates the number of hash slots in the cluster that are in the OK state. |
Number |
If the value of this measure is the same as the value of the Number of hash slots assigned to cluster measure, it means that all hash slots mapped to all nodes in the cluster are working correctly. On the other hand, if the value of this measure is much lower than the value of the Number of hash slots assigned to cluster measure, it means that the hash slots mapped to some nodes are in the FAIL or PFAIL state. You may want to look up the values of the Number of hash slots in PFAIL state and Number of hash slots in FAIL state measures to confirm this. |
||||||
Number of hash slots in PFAIL state |
Indicates the number of hash slots that are mapped to a node in PFAIL state. |
Number |
Ideally, the value of this measure should be very low or 0. A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and slave nodes can flag another node as PFAIL, regardless of its type. Note that those hash slots still work correctly, as long as the PFAIL state is not promoted to FAIL by the failure detection algorithm. PFAIL only means that we are currently not able to talk with the node, but may be just a transient error. |
||||||
Number of hash slots in FAIL state |
Indicates the number of hash slots that are mapped to a node in FAIL state. |
Number |
Every node sends gossip messages to every other node including the state of a few random known nodes. Every node eventually receives a set of node flags for every other node. This way every node has a mechanism to signal other nodes about failure conditions they have detected. A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:
If all the above conditions are true, Node A will:
Ideally therefore, the value of this measure should be 0. |
||||||
Number of messages sent via the cluster node-to-node |
Indicates the number of messages sent via the cluster node-to-node binary bus. |
Number |
All the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly). |
||||||
Number of messages received via the cluster node-to-node |
Indicates the number of messages received via the cluster node-to-node binary bus. |
Number |
|||||||
Cluster state |
Indicates the current state of the cluster. |
|
This measure can report any of the following values:
The numeric values that correspond to the measure values discussed above are as follows:
Note: This measure reports the Measure Values listed in the table above to indicate the cluster state. The graph of this measure however, indicates the same using the numeric equivalents only. |
||||||
Number of nodes in the cluster |
Indicates the number of nodes in the cluster. |
Number |
To know the details of the nodes in the cluster, use the detailed diagnosis of this measure. |
||||||
Fail over configuration version assigned to this node |
Indicates the local current Epoch variable. |
Number |
Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch. An Epoch is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date. The currentEpoch is a 64 bit unsigned number. At node creation every Redis Cluster node, both slaves and master nodes, set the currentEpoch to 0. Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the currentEpoch is updated to the sender epoch. Because of these semantics, eventually all the nodes will agree to the greatest currentEpoch in the cluster. This information is used when the state of the cluster is changed and a node seeks agreement in order to perform some action. Currently this happens only during slave promotion. |
||||||
Number of hash slots assigned to cluster |
Indicates the total number of hash slots assigned to the cluster. |
Number |
|
||||||
Number of master nodes in cluster |
Indicates the number of master nodes in the cluster. |
Number |
|
||||||
Number of nodes added to the cluster |
Indicates the number of nodes added to the cluster. |
Number |
Use the detailed diagnosis of this measure to know which nodes were recently added to the cluster. |
||||||
Number of nodes deleted from the cluster |
Indicates the number of nodes deleted from the cluster. |
Number |
Use the detailed diagnosis of this measure to know which nodes were recently deleted from the cluster. |