Mongo Replication Throughput Test

Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication provides a level of fault tolerance against the loss of a single database server.

A replica set is a group of mongod instances that maintain the same data set. A replica set contains several data bearing nodes and optionally one arbiter node. Of the data bearing nodes, one and only one member is deemed the primary node, while the other nodes are deemed secondary nodes. The primary node receives all write operations. A replica set can have only one primary capable of confirming writes with { w: "majority" } write concern. The primary records all changes to its data sets in its operation log, i.e. oplog. The oplog is a limited-size collection stored on primary nodes that keeps track of all the write operations. Secondary members replicate this log and apply the operations to their data sets.

If the secondary is unable to apply the changes as fast as they are written to the primary's oplog, then changes will be lost if the primary crashes. Similarly, if the oplog is not sized right, then it will not be able to hold many changes, thus causing significant data loss in the event of a primary failure. This is why, administrators should constantly measure the level of activity on a replica set, check whether the oplog is sized according to this workload, and also ensure that there is little-to-no time lag in data replication between the primary and secondaries of the replica set. This can be achieved using the Mongo Replication Throughput test.

This test tracks the operations performed on the replica set and in the process, reveals the load on the replica set. The test also tracks the usage of the oplog and alerts administrators if the oplog is about to run out of space for recording changes. Additionally, the test also keeps an eye out for long gaps between when a change is recorded in the primary's oplog and when it is actually applied on the secondary , and promptly notifies administrators of the same. This way, the test brings inconsistencies in data replication to the immediate attention of administrators and averts the data loss that might occur if the primary crashes.

Target of the test : A MongoDB server

Agent deploying the test : An internal/remote agent

Outputs of the test : One set of results for the Mongo database server being monitored.

Configurable parameters for the test
Parameter Description

Test period

How often should the test be executed.

Host

The host for which the test is to be configured.

Port

The port number at which the specified host listens

Database Name

The test connects to a specific Mongo database to run API commands and pull metrics of interest. Specify the name of this database here. The default value of this parameter is admin.

Username and Password

The eG agent has to be configured with the credentials of a user who has the required privileges to monitor the target MongoDB instance, if the MongoDB instance is access control enabled. To know how to create such a user, refer to How to monitor access control enabled MongoDB database?. If the target MongoDB instance is not access control enabled, then, specify none against the Username and Password parameters.

Confirm Password

Confirm the password by retyping it here.

Authentication Mechanism

Typically, the MongoDB supports multiple authentication mechanisms that users can use to verify their identity. In environments where multiple authentication mechanisms are used, this test enables the users to select the authentication mechanism of their interest using this list box. By default, this is set to None. However, you can modify this settings as per the requirement.

SSL

By default, the SSL flag is set to No, indicating that the target MongoDB server is not SSL-enabled by default. To enable the test to connect to an SSL-enabled MongoDB server, set the SSL flag to Yes.

CA File

A certificate authority (CA) file contains root and intermediate certificates that are electronically signed to affirm that a public key belongs to the owner named in the certificate. If you are looking to monitor the certificates contained within a CA file, then provide the full path to this file in the CA File text box. For example, the location of this file may be: C:\cert\rootCA.pem. If you do not want to monitor the certificates in a CA file, set this parameter to none.

Certificate Key File

A Certificate Key File specifies the path on the server where your private key is stored. If you are looking to monitor the Certificate Key File, then provide the full path to this file in the Certificate Key File text box. For example, the location of this file may be: C:\cert\mongodb.pem. If you do not want to monitor the certificates in a CA file, set this parameter to none.

Measurements made by the test
Measurement Description Measurement Unit Interpretation

Insert operations

Indicates the rate at which replicated insert operations were performed on the target server.

Inserts/Sec

A consistent increase in the value of these measures could indicate a high level of activity on the replica set.

Query operations

Indicates the rate at which replicated query operations are performed on the target server.

Queries/Sec

Update operations

Indicates the rate at which replicated update operations are performed on the target server.

Updates/Sec

Delete operations

Indicates the rate at which replicated delete operations are performed on the target server.

Deletes/Sec

Get more operations

Indicates the rate at which replicated get more operations are performed on the target server.

Getmores/Sec

The value of this measure can be high even if the query count is low. Secondary nodes send getMore operations as part of the replication process.

Command operations

Indicates the rate at which replicated commands are issued to the target server.

Commands/Sec

 

Replication lag

Indicates how far a secondary is behind a primary.

Secs

Ideally, the value of this measure should be 0. If it is very high, then the integrity of your data set might be compromised in case of failover (secondary member taking over as the new primary because the current primary is unavailable). A high value also implies that write operations are not immediately propagated to secondaries; in this case, related changes might be lost if the primary fails.

A high replication lag can be due to:

  • Networking issue between the primary and secondary making nodes unreachable
  • A secondary node applying data slower than the primary
  • Insufficient write capacity in which case you should add more shards;
  • Slow operations on the primary node blocking replication;
  • Heavy write operations on the primary node or an under-provisioned secondary. You can prevent the latter by scaling up the secondary to match the primary capacity.

Oplog window

Indicates the interval of time between the oldest and the latest entries in the oplog.

Secs

If a secondary is down longer than this oplog window, it will be able to catch up unless it completely resyncs all data from the primary. The amount of time it takes to fill the oplog varies: during heavy traffic times, it will shrink since the oplog will receive more operations per second. If the oplog window for a primary node is getting too short you should consider increasing the size of your oplog.

Replication head room

Indicates the time difference between the primary’s oplog window and the replication lag of the secondary .

Secs

If the replication headroom is rapidly shrinking and is about to become negative, that means that the replication lag is getting higher than the oplog window. In that case, write operations recorded in the oplog will be overwritten before secondary nodes have time to replicate them. MongoDB will constantly have to resync the entire data set on this secondary which takes much longer than just fetching new changes from the oplog. Properly monitoring and alerting on Replication lag and Oplog window should allow you to prevent this.

Total oplog size

Indicates the amount of space allocated to the oplog.

MB

 

Used oplog size

Indicates the amount of space currently used by operations stored in the oplog.

MB

If this value grows closer to the Total oplog size, it is a cause for concern. This is because it implies that the oplog may soon not have enough space to record any more changes. To avoid this, you may want to consider resizing your oplog. Before that, check the level of activity on the replica set and figure out what would be the ideal size setting for the oplog, so that its able to capture all the changes that occur on the replica set.

Before mongod creates an oplog, you can specify its size with the oplogSizeMB option. If you can predict your replica set’s workload to resemble one of the following patterns, then you might want to create an oplog that is larger than the default. Conversely, if your application predominantly performs reads with a minimal amount of write operations, a smaller oplog may be sufficient.

The following workloads might require a larger oplog size.

  • The oplog must translate multi-updates into individual operations in order to maintain idempotency. This can use a great deal of oplog space without a corresponding increase in data size or disk use.
  • If you delete roughly the same amount of data as you insert, the database will not grow significantly in disk use, but the size of the operation log can be quite large.