RabbitMQ Queues Test

A queue is best defined as a buffer that stores messages. The basic architecture of a message queue is simple, there are client applications called producers that create messages and deliver them to the broker (the message queue). Other applications, called consumers, connects to the queue and subscribes to the messages to be processed. A software can be a producer, or consumer, or both a consumer and a producer of messages. Messages placed onto the queue are stored until the consumer retrieves them.

Since contents of a queue are typically located on a single node in the cluster, failure of that node can deny consumers access to that queue and its contents. Administrators should therefore continuously track the status of the queues, identify queues that are down, and get the down queues up and running before users complain. Also, too many messages in a queue can cause a contention for memory resources. This in turn can slow down message processing. To avoid this, administrators should track the length of each queue, promptly identify the queues that are consistently growing in length , and rapidly initiate measures to curb queue growth. The RabbitMQ Queues test helps administrators with all of the above!

This test auto-discovers the queues on a target node. For each queue, the test then reports the status of the queue, the total number of messages in the queue, and the type of messages - eg., unacknowledged, published, confirmed - in each queue, so administrators can precisely pinpoint queues that are growing in length at an alarming rate, know what type of messages are in such queue, and accordingly decide on how to control the growth in queue length. Additionally, the test reports on the memory usage of each queue, so you can easily assess the impact of queue length on memory. Moreover, the test also reports the rate at which the different types of messages in a queue are delivered to consumers, so that any bottleneck in delivery can be proactively detected and promptly fixed.

Target of the test : A RabbitMQ Cluster

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each queue in the target node

First-level descriptor: Node name

Second-level descriptor: Queue name

Configurable parameters for the test
Parameters Description

Test period

How often should the test be executed

Host

The host for which the test is to be configured.

Port

The port at which the configured Host listens; by default, this is 15672

Username, Password, and Confirm Password

The eG agent connects to the Management Interface of the rabbitmq-management plugin of the target node, and runs HTTP-based API commands on the node using the plugin to pull metrics of interest. To connect to the plugin and run the API commands, the eG agent requires the privileges of a user on the cluster who has been assigned the 'monitoring' tag. If such a user pre-exists, then configure this test with the Username and Password of that user. On the other hand, if no such user exists, then you will have to create a user for this purpose using the Management Interface. The steps for this have been detailed in How Does eG Enterprise Monitor a RabbitMQ Cluster? In this case, make sure you configure this test with the Username and Password of the new user. Finally, confirm the password by retyping it in the Confirm Password text box.

SSL

By default, this flag is set to No, as the target node is not SSL-enabled by default. If the node is SSL-enabled, then set this flag to Yes.

Num DD Messages

By default, this parameter is set to 10. This means that, by default, the detailed diagnosis of this test will report the details of the top-10 queues in terms of the number of messages in each queue. To view the details of more queues as part of detailed metrics, you will have to increase the value of this parameter. Likewise, to view the details of less than 10 queues, reduce the value of this parameter.

Individual Queues

If you want the test to report metrics for every queue, then set this flag to Yes. In this case, each queue process will be a descriptor of this test. On the other hand, if you want to receive an overview of the queue load on the cluster, you can set this flag to No. In this case, the test will report metrics for a Summary descriptor alone.

Queue group

Provide a comma-separated list of queue groups you want monitored.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

State

Indicates the state of this queue

 

The values that this measure can take and their corresponding numeric values are as follows:

Measure Value Numeric Value

Running

0

Idle

1

Note:

This test reports the Measure Values listed in the table above to indicate the queue status. In the graph of this measure however, the same will be represented using the numeric equivalents.

This measure is not available for the Summary descriptor.

Running queues

Indicates the current count of running queues.

Number

This measure is reported only for the Summary descriptor.

 

Use the detailed diagnosis of this measure to know which queues in the cluster are running.

Idle/down queues

Indicates the current count of idle/down queues.

Number

This measure is reported only for the Summary descriptor.

 

Use the detailed diagnosis of this measure to know which queues in the cluster are not running. Ideally, the value of this measure should be 0. If one/more queues go down frequently, you may want to consider mirroring the queues across multiple nodes in a cluster to ensure high availability of the queues.

Each mirrored queue consists of one master and one or more mirrors. The master is hosted on one node commonly referred as the master node. Each queue has its own master node. All operations for a given queue are first applied on the queue's master node and then propagated to mirrors. This involves enqueueing publishes, delivering messages to consumers, tracking acknowledgements from consumers and so on.

Messages published to the queue are replicated to all mirrors. Consumers are connected to the master regardless of which node they connect to, with mirrors dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work).

If the node that hosts queue master fails, the oldest mirror will be promoted to the new master as long as it synchronised. Unsynchronised mirrors can be promoted, too, depending on queue mirroring parameters.

Unacknowledged messages

Indicates the number of messages in this queue that are waiting for acknowledgement.

For the Summary descriptor, this measure will report the total number of messages across all queues that are waiting for acknowledgement.

Number

A low value is desired for this measure. This is because, all unacknowledged messages have to reside in RAM on the servers. If you have too many unacknowledged messages you will run out of memory. An efficient way to limit unacknowledged messages is to limit how many messages your clients prefetch.

To know which queue attached to which node in the cluster has the maximum number of unacknowledged messages, use the detailed diagnosis of this measure.

Messages in queue

Indicates the total number of messages currently in queue.

For the Summary descriptor, this measure indicates the total number of messages across all queues in the cluster.

Number

The value of this measure indicates the current queue length.

Ideally, the value of this measure should be small for any queue. This is because, short queues are the fastest. When a queue is empty, and it has consumers ready to receive messages, then as soon as a message is received by the queue, it goes straight out to the consumer.

Many messages in a queue can put a heavy load on RAM usage. When this happens, RabbitMQ will start flushing (page out) messages to disk in order to free up RAM, and when that happens queueing speeds will deteriorate.

Some common problems with long queues are as follows:

  • Small messages embedded in queue index
  • Take a long time to sync between nodes
  • Time-consuming to start a server with many messages
  • RabbitMQ management interface collects and stores stats for all queues

There are many ways by which you can limit queue size. For starters, you can limit the maximum length of a queue to a set number of messages, or a set number of bytes (the total of all message body lengths, ignoring message properties and any overheads), or both. By default, when a maximum queue length or size is set and the maximum is reached is to drop or dead-letter messages from the front of the queue (i.e. the oldest messages in the queue).

Queue size can also be limited using a Time-To-Live (TTL) extension. RabbitMQ allows you to set TTL (time to live) for both messages and queues.

When TTL is set for a queue, then any message that has been in the queue for longer than the configured TTL is said to be dead.The server guarantees that dead messages will not be delivered using basic.deliver (to a consumer) or included into a basic.get-ok response (for one-off fetch operations). Further, the server will try to remove messages at or shortly after their TTL-based expiry.

A TTL can be specified on a per-message basis, by setting the expiration field in the basic AMQP class when sending a basic.publish. The value of the expiration field describes the TTL period in milliseconds. The same constraints as for x-message-ttl apply. Since the expiration field must be a string, the broker will (only) accept the string representation of the number.

Published message rate

Indicates the rate at which publishers are publishing messages to this queue.

For the Summary descriptor, this measure indicates the average rate at which publishers are publishing messages across queues.

Messages/Sec

 

Publisher confirmation rate

Indicates the rate at which the receipt of a message into this queue is confirmed to a publisher.

For the Summary descriptor, this measure indicates the average rate at which the receipt of messages across all queues is confirmed to a publisher.

Messages/Sec

A 'Publish Confirm' is nothing but a acknowledgement sent by the cluster to a publisher confirming the receipt of a message from that publisher. Publish Confirms have a performance impact. This means, the lower the value of this measure, the better. However, one should keep in mind that a Publish Confirm is required if the publisher needs at-least-once processing of messages.

Manually acknowledged message delivery rate

Indicates the rate at which this queue delivers messages to consumers that use manual acknowledgements.

For the Summary descriptor, this measure indicates the average rate at which the queues in the cluster deliver messages to consumers that use manual acknowledgements.

Messages/Sec

Messages in transit might get lost in an event of a connection failure, and such a message might need to be retransmitted. Acknowledgements let the server and clients know when to retransmit messages.

A manual acknowledgement is an 'explicit' acknowledgement that is received from the consumer. Manually sent acknowledgements can be positive or negative. Positive acknowledgements simply instruct RabbitMQ to record a message as delivered and can be discarded. Negative acknowledgements with basic.reject have the same effect. The difference is primarily in the semantics: positive acknowledgements assume a message was successfully processed while their negative counterpart suggests that a delivery wasn't processed but still should be deleted.

Whether positive or negative, manual acknowledgements deliver low throughput and hence, should be avoided. A low value is therefore desired for this measure.

Auto-acknowledged message delivery rate

Indicates the rate at which this queue delivers messages to consumers that use automatic acknowledgements.

For the Summary descriptor, this measure indicates the average rate at which the queues in the cluster deliver messages to consumers that use automatic acknowledgements.

Messages/Sec

 

Consumer acknowledgement rate

Indicates the rate at which messages in this queue are being acknowledged by consumers.

For the Summary descriptor, this measure indicates the average rate at which the messages across queues are being acknowledged by consumers.

Messages/Sec

If the Delivery rate of messages requiring acknowledgement measure registers an abnormally low value, then, you may want to check the value of this measure at around the same time to determine whether the a delay by consumers in acknowledging the messages was what caused the delivery delay.

Message redelivery rate

Indicates the rate at which this queue delivers messages with the 'redelivered' flag set.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages with the 'redelivered' flag set.

Messages/Sec

 

Delivery rate of messages requiring acknowledgement

Indicates the rate at which this queue delivers messages requiring acknowledgement in response to basic.get.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages requiring acknowledgement in response to basic.get.

Messages/Sec

Compare the value of this measure with that of the Delivery rate of messages not requiring acknowledgement measure to figure out what type of messages are being delivered much slower than the rest.

Delivery rate of messages not requiring acknowledgement

Indicates the rate at which messages not requiring acknowledgement are being delivered by this queue in response to basic.get.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster deliver messages not requiring acknowledgement in response to basic.get.

Messages/Sec

Compare the value of this measure with that of the Delivery rate of messages requiring acknowledgement measure to figure out what type of messages are being delivered much slower than the rest.

Message return rate

Indicates the rate at which this queue sent unrouteable messages with 'mandatory' flag set to 'true', to publishers.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster sent unrouteable messages with 'mandatory' flag set to 'true', to publishers.

Messages/Sec

An unroutable message is a message without a destination. For example, a message sent to an exchange without any bound queue.

If the 'mandatory' flag is set to 'true', then the cluster return an unroutable message to the producer with a `basic.return` AMQP method.

Disk read rate

Indicates the rate at which this queue reads messages from disk.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster read messages from disk.

 

Messages/Sec

A high value could indicate that messages are frequently read from the disk and not from the RAM. This could be owing to high memory pressure, which may have forced RabbitMQ to move messages from RAM to disk.

Disk write rate

Indicates the rate at which this queue wrote messages to disk.

For the Summary descriptor, this measure reports the average rate at which the queues in the cluster wrote messages to disk.

Messages/Sec

A high value for this measure could indicate any of the following:

  • Many messages have been published in such a way that they must be written to disk;
  • A very high memory pressure on RabbitMQ has caused the cluster to move majority of messages from RAM to disk;

Queues

Indicates the total number of queues currently in the cluster.

Number

This measure is reported only for the Summary descriptor.

Use the detailed diagnosis of the measure to know which are the queues in the target cluster.

Used memory

Indicates the total memory used by this queue.

For the Summary descriptor, this measure reports the total memory used up by all queues in cluster.

MB

Ideally, the value of this measure should be low. A consistent increase in this value could indicate excessive memory usage, and could hint at a potential memory contention. The key factor impacting memory usage is message queue length. Time-correlate the changes in the Messages in queue measure with that of the Used memory measure to figure out if the queue length is indeed increasing the memory pressure. In which case, you may want to initiate measures to limit the queue size.

To know which queue attached to which node in the cluster has the maximum number of unacknowledged messages, use the detailed diagnosis of the Unacknowledged messages measure. The state of the queue is also revealed as part of detailed diagnostics. If your cluster is running out of memory, you can use the detailed statistics to identify which queue attached to which node is draining the memory by holding many unacknowledged messages.

Figure 1 : The detailed diagnosis of the Unacknowledged messages measure

To know which queues returned the maximum number of messages to publishers, use the detailed diagnosis of the Message return rate measure.

Figure 2 : The detailed diagnosis of the Message return rate measure

Using the detailed diagnosis of the Disk read rate and Disk write rate measures, you can easily identify the exact queue that is performing many reads from the disk and/or numerous writes to the disk, and which node that queue is attached to. Excessive read-write activity on disk is a sign of a memory contention in the cluster. Under such circumstances, you can use the detailed diagnostics to identify the node and the queue that could be contributing to the contention.

Figure 3 : The detailed diagnosis of the Disk read rate measure

Figure 4 : The detailed diagnosis of the Disk write rate measure