RabbitMQ Nodes Test

A RabbitMQ cluster is a logical grouping of one or several Erlang nodes, each running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters.

A client can connect to any node and perform any operation. Nodes will route operations to the queue master node transparently to clients. In case of a node failure, clients will be able to reconnect to a different node, recover their topology and continue operation. Regardless of which node is serving client requests, at any point in time, administrators should be able to tell the operational state of each node in the cluster, so that the failed nodes can be identified.

Moreover, client connections, channels, and queues are distributed across cluster nodes. This means that all nodes in a cluster should be sized with adequate resources such as memory, bandwicth, disk space, file/socket handlers, and Erlang processes. Administrators should be able to track the usage of these critical resources on each node, and pinpoint the node that is under-sized.

Additionally, administrators should observe the reads from and writes to the queue index journals, message store, and disk of each node to gauge the level of activity on each node and measure a node's ability to handle these activity levels.

The RabbitMQ Nodes test enables administrators to perform all the above! This test auto-discovers the nodes in a target cluster. For each node, the test then reports the state of that node, its uptime, and how its memory, file descriptors, socket descriptors, bandwidth resources and Erlang processes have been utilized. Nodes that are down and those that are running out of resources are revealed in the process. Furthermore, the test reports the rate at which reads, writes, seeks, and syncs were performed on the disk of each node, thus revealing the I/O processing ability of each node. The time taken by every node to perform these I/O operations is also reported, so that latent nodes can be identified.

Target of the test : A RabbitMQ Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each node in the monitored RabbitMQ Cluster

Configurable parameters for the test
Parameters Description

Test period

How often should the test be executed

Host

The host for which the test is to be configured.

Port

The port at which the configured Host listens; by default, this is 15672

Username, Password, and Confirm Password

The eG agent connects to the Management Interface of the rabbitmq-management plugin of the target node, and runs HTTP-based API commands on the node using the plugin to pull metrics of interest. To connect to the plugin and run the API commands, the eG agent requires the privileges of a user on the cluster who has been assigned the 'monitoring' tag. If such a user pre-exists, then configure this test with the Username and Password of that user. On the other hand, if no such user exists, then you will have to create a user for this purpose using the Management Interface. The steps for this have been detailed in How Does eG Enterprise Monitor a RabbitMQ Cluster? In this case, make sure you configure this test with the Username and Password of the new user. Finally, confirm the password by retyping it in the Confirm Password text box.

SSL

By default, this flag is set to No, as the target node is not SSL-enabled by default. If the node is SSL-enabled, then set this flag to Yes.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Status

Indicates the current state of this node.

Number

The values that this measure can report and their corresponding numeric values are listed in the table below:

Measure Value Numeric Value

Running

1

Stopped

0

Note:

This test reports the Measure Values listed in the table above to indicate the current operational state of a node. In the graph of this measure however, the same will be represented using the numeric equivalents.

Uptime

Indicates the uptime of this node (in days).

Days

Compare the value of this measure across nodes to identify the node that has been down for the longest time.

Maximum file descriptors

Indicates the maximum number of file descriptors that this node can use.

Number

By default, a node can use up to a maximum of 1024 file descriptors.

A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket.

Used file descriptors

Indicates the number of file descriptors that have been utilized on this node.

Number

This count includes both file and socket descriptors.

A file descriptor (FD, less frequently fildes) is an abstract indicator (handle) used to access a file or other input/output resource, such as a pipe or network socket.

A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets. Socket descriptors are implemented as file descriptors in the UNIX System.

File descriptor usage

Indicates what percentage of the maximum number of file descriptors configured for this node is currently in use.

Percent

A value close to 100% is a cause for concern, as it indicates that the node is running out of file handles. If the value reaches 100%, then the node will block all incoming connections. To avoid this, you may want to consider increasing the maximum file descriptor configuration of the node.

Maximum socket descriptors

Indicates the maximum number of socket descriptors that this node can use.

Number

A socket is an abstraction of a communication endpoint. Just as they would use file descriptors to access a file, applications use socket descriptors to access sockets.

Used socket descriptors

Indicates the number of socket descriptors this node is using.

Number

 

Socket descriptors usage

Indicates what percentage of the maximum number of socket descriptors configured for this node is presently in use.

Percent

A value close to 100% is a cause for concern, as it indicates that the node is running out of socket descriptors. If a node exhausts socket descriptors, then that node will block all incoming connections. To avoid this, you may want to consider increasing the maximum socket descriptor configuration of the node.

Maximum erlang processes

Indicates the maximum number of Erlang processes configured for this node.

Number

 

Used erlang processes

Indicates the number of Erlang processes that this node is currently using.

Number

 

Erlang process usage

Indicates what percentage of the maximum number of Erlang processes configured for this node is currently in use.

Percent

Queues, connections, and channels are the main components of RabbitMQ that consume processes. This means, higher the number of queues, connections, and channels, higher will be the usage of Erlang processes.

If the value of this measure is close to 100% for a node, it implies that that node is running out of Erlang processes. This could cause databases to hang and messages to start piling up on RabbitMQ. To avoid this, you may want to consider increasing the maximum number of Erlang processes configured for the node.

Maximum memory

Indicates the maximum amount of memory this node can use.

MB

 

Used memory

Indicates the amount of memory in use on this node.

MB

 

Memory usage

Indicates the percentage of memory that this node is using.

Percent

A value close to 100% indicates that the node is running out of memory. If the situation is allowed to persist, then the node may soon exhaust its memory completely. This could bring messaging operations to a standstill. 

At this juncture, you can start by looking at some of the common memory consumers on a node are as follows:

  • Connections
  • Channels
  • Queue masters, indices, and messages kept in memory
  • Queue mirrors, indices, and messages kept in memory
  • Binaries containing message bodies and metadata
  • Plugins
  • Mnesia tables and other ETS tables that keep an in-memory copy of their data
  • Memory used by code and atoms

If any of the afore-mentioned factors are large in number, then memory consumption too is likely to increase. In such a situation, to conserve memory, see if you can control the memory consumption of the aforesaid components by decreasing their count. For instance, see if you can optimize the number of channels your applications typically use and bring that number down.

Alternatively, you may want to consider increasing the memory allocation to the node.

RAM-only Mnesia transactions

Indicates the rate at which RAM-only Mnesia transactions take place on this node.

Transactions/Sec

An Mnesia transaction is a mechanism by which a series of database operations can be executed as one functional block.

If the transaction is performed on data stored exclusively in memory, it is a RAM-only Mnesia transaction. An example of such a transaction is creation/deletion of transient queues.

If the transaction is performed on data stored in disk, it a disk transaction. An example of such a transaction is creation/deletion of durable queues.

Disk Mnesia transactions

Indicates the rate at which Mnesia transactions take place on this node's disk.

Transactions/Sec

Index journals

Indicates the rate at which message information is written to queue index journals on this node.

Messages/Sec

Each record in a queue index journal represents a message being published to a queue, being delivered from a queue, and being acknowledged in a queue.

If the value of this measure keeps increasing consistently, it could indicate that one/more queues reside on the target node and that there is a high level of messaging activity on that node.

Store reads

Indicates the rate at which messages are read from the message store on this node.

Messages/Sec

Messages (the body, and any properties and / or headers) can either be stored directly in the queue index, or written to the message store.

The message store is a key-value store for messages, shared among all queues in the server. There are technically two message stores (one for transient and one for persistent messages) but they are usually considered together as "the message store".

If the value of the Store writes measure increases consistently, it could indicate the presence of many lazy queues on the node with many messages. To accommodate all these messages in the message store, the node will have to be sized with sufficient disk space and file descriptors.

Store writes

Indicates the rate at which messages are written to the message store on this node.

Messages/Sec

Index reads

Indicates the rate at which segment files are read from the queue index on this node.

Messages/Sec

The queue index is responsible for maintaining knowledge about where a given message is in a queue, along with whether it has been delivered and acknowledged. There is therefore one queue index per queue.

If the values reported by these measures are consistently high for a node, it could mean that one/more queues reside on the node and many messages are stored in the queues.

Index writes

Indicates the rate at which segment files are read from written to the queue index on this node.

Messages/Sec

Disk reads

Indicates the rate at which read operations are performed on the disk on this node.

IOPS

Compare the values of these measures across nodes to identify the node that is experiencing a high level of disk activity. In the process, you can identify which type of I/O operations are very common on that node - reads? writes? seeks? or syncs?

Disk writes

Indicates the rate at which write operations are performed on the disk on this node.

IOPS

Disk seeks

Indicates the rate at which seek operations are performed on the disk on this node. A seek operation happens when the disk physically locates a piece of data on it, when reading/writing.

IOPS

Disk syncs

Indicates the rate at which the node invokes fsync() to ensure data is flushed to the disk.

IOPS

Read bandwidth

Indicates the rate at which data is read by this node.

MB/Sec

These measures are good indicators of the bandwidth used by a node when reading/writing. You can compare the value of this measure across nodes to know which node is consuming maximum bandwidth.

Write bandwidth

Indicates the rate at which data is written by this node.

MB/Sec

Disk read time

Indicates the average time taken by this node to perform a read operation.

Millisecs

Compare the values of these measures across nodes to identify the most latent node. In the process, you can identify which type of I/O operations are taking the longest on each node - reads? writes? seeks? or syncs?

Disk write time

Indicates the average time taken by this node to perform a write operation.

Millisecs

Disk seek time

Indicates the average time taken by this node to perform a seek operation.

Millisecs

Disk sync time

Indicates the average time taken by this node to complete an fsync().

Millisecs