Mailbox Databases Replication Test

To protect Exchange Server 2013/2016 mailbox databases and the data they contain, Mailbox servers and databases can be configured for high availability and site resilience. A DAG (Database Availability Group) is the base component of the high availability and site resilience framework built into Exchange 2013/2016. A DAG is a group of up to 16 Mailbox servers that host a set of databases and provides automatic, database-level recovery from failures that affect individual databases, networks, or servers. Once a DAG is created, administrators can create up to 16 copies of an Exchange 2013/2016 mailbox database on multiple Mailbox servers within this DAG. While one of the mailbox copies is set as the active copy, the other copies can be set as passive copies. Each DAG member hosting a copy of a given mailbox database participates in a process of continuous replication to keep the copies consistent. Database replication occurs between Exchange Server 2013/2016 DAG members using two different methods:

File Mode replication – each transaction log is fully written (a 1MB log file) and then copied from the DAG member hosting the active database copy to each DAG member that host a passive database copy of that database.

The other DAG members then replay the transaction log file into their own passive copy of the database to update it.

Block mode replication – In this case, each database transaction is written to the log buffer on the active server and also sent to the log buffer of DAG members hosting passive copies of the database. As the log buffer becomes full, the member of the DAG is then able to build their own transaction log file for replay into their passive database copy.

Latencies in replication can cause the active and passive mailbox database copies to be out-of-sync, resulting in inconsistencies in mailbox data in the event of a failure. In order to avert such anomalies, Exchange administrators should keep a close watch on the database replication activity, spot potential delays in replication, identify where the replication process is stalling, and clear the bottleneck quickly, so that there is no loss of data when a server/database failure occurs. To achieve this, administrators can use the Mailbox Databases Replication test. This test auto-discovers the mailbox databases on the Mailbox server, and for each database, reports the replication mode, the number of log files that are pending copying, inspection, and replay, and the health of the database copies and content index. This way, the test instantly captures replication bottlenecks, the source of the bottleneck – copying? inspection? replaying? - and the abnormal state of database copies. In addition, the test also reports how each database uses the disk space in the Mailbox server, thus pinpointing those databases that are consuming too much space and could hence be candidates for migration to other servers in the DAG. In the process, the test turns the spot light on a potential space crunch in the server that could cause replication to fail.

This test is disabled by default. To enable the test, go to the enable / disable tests page using the menu sequence : Agents -> Tests -> Enable/Disable, pick the Microsoft Exchange 2013/2016 as the Component type, set Performance as the Test type, choose the test from the disabled tests list, and click on the >> button to move the test to the ENABLED TESTS list. Finally, click the Update button.

Target of the test : A Microsoft Exchange 2013/2016 server

Agent deploying the test : An internal/remote agent

Outputs of the test : One set of results for the each ESE database on the Exchange 2013/2016 server being monitored

Configurable parameters for the test
Test period - How often should the test be executed Host - The host for which the test is to be configured. port – The port at which the host listens. xchgextensionshellpath - The Exchange Management Shell is a command-line management interface, built on Windows PowerShell which enables you to administer every part of Microsoft Exchange. This test uses the Exchange management shell to run scripts and collect the desired performance metrics from the Exchange server. By default, the test auto-discovers the location of the Exchange management shell and thus, automatically loads the Exchange management shell snap-in (exshell.psc1) for script execution. This is why, the xchgextensionshellpath is set to none by default. To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Continuous replication:

Indicates the mode of continuous replication in which this database is currently running.

Number

This measure reports the value 1 if the database is running in continuous replication - block mode.

This measure reports the value 0 if the database is running in continuous replication - file mode.

Generation of the last log file number:

Indicates the log generation number of the last log file which has been copied to this database.

Number

When a log file reaches a certain size, it's renamed with the next generation sequence number, and a new log file is created. This measure reports the sequence number of the last log file that was copied to a database for inspection and replay.

Generation of last log file copied notification:

Indicates the log generation number of the last log file copied to this database, which the copier knows about.

Number

Compare the value of this measure with that of the Generation of the last log file number measure to check for discrepancies. If it exists, it could indicate a problem in copying. Further investigation may be required to determine the reason for this anomaly.

Copy queue length:

Indicates the number of log generations for this database that are waiting to be both copied and inspected successfully.

Number

A high value for this measure could indicate a delay in copying, and may warrant an investigation.

Generation of last log file inspected:

Indicates the log generation number of the last log file related to this database that was inspected successfully.

Number

By comparing the value of this measure with that of the Generation of last log file copied notification measure, you can figure out if the sequence number of the last log file that was inspected is way behind that of the log file that was copied. If so, it could indicate that inspection is taking too long a time.

Log copy rate:

Indicates the number of bytes of logged data related to this database that was copied per second.

KB/Sec

A consistent drop in this value could indicate that log files are being copied slowly. This in turn can impact how quickly database replication is carried out.

Is log copy falling behind?:

Indicates whether or not log copying and inspection are able to keep up with log generation for this database.

If log copying and inspection are lagging behind, then this measure will return the value True. If copying and inspection are able to keep up with log generation, this measure will return the value False.

The numeric values that correspond to the above-mentioned measure values are as follows:

Measure Value	Numeric Value
True	1
False	0

If this measure reports the value False, it is a cause for concern as it could indicate one fo the following:

Log generation rate is high;
Log copying and inspection is very slow

Log generation rates can increase owing to:

Corruption of the database copy;
The presence of a number of messages in the database that are of a large size.
Many mailbox moves

Note:

By default, this measure reports one of the Measure Values listed in the table above. In the graph of this measure however, the corresponding numeric values are only plotted.

Is log replay falling behind:

Indicates whether/not log replay is able to keep up with log copying and inspection for this database.

If log copying replay is lagging behind, then this measure will return the value True. If replay is able to keep up with log copying and inspection, this measure will return the value False.

The numeric values that correspond to the above-mentioned measure values are as follows:

Measure Value	Numeric Value
True	1
False	0

If this measure reports the value False, it is an indicator that log replay is too slow. One of the reasons for this is the configuration of a high replay lag time. Replay lag time is a property of a mailbox database copy that specifies the amount of time, in minutes, to delay log replay for the database copy. If a high value is set for this property, then a delay in log replaying becomes inevitable. To speed up log replaying, reduce the value of this properly.

Note:

By default, this measure reports one of the Measure Values listed in the table above. In the graph of this measure however, the corresponding numeric values are only plotted.

Generation of last log file replayed:

Indicates the log generation number of the last log file related to this database that was replayed.

Number

By comparing the value of this measure with that of the Generation of last log file inspected measure will throw light on the gap between the inspected and the replayed logs, thus revealing how many logs are pending replay. This way, administrators can figure out if the replication process is spending too much time on log replaying.

Replay lag:

Indicates the percentage of actual lag in replay of the log files related to this database, relative to the configured lag.

Percent

This measure is a good indicator of the amount of lag a database copy with replay lag configured is actually currently realizing. Replay lag time is a property of a mailbox database copy that specifies the amount of time, in minutes, to delay log replay for the database copy.

A high value for this measure is a cause for concern as it indicates that log file replaying is delayed beyond the permitted limit. This implies that a high replay lag configuration is not the reason for replaying to slow down. The real reasons for the delay should hence be investigated and determined.

Replay queue length:

Indicates the the number of log generations pertaining to this database that are waiting to be replayed.

Number

A high value for this measure could indicate a delay in replaying and may warrant an investigation.

Status:

Indicates the health and status of this database copy.

The values that this measure can take and their corresponding numeric values are as follows:

Measure Value	Numeric Value
ServiceDown	0
Suspended	1
ActivationSuspended	2
Failed	3
FailedAndSuspended	4
DisconnectedAndHealthy	5
DisconnectedAndResynchronizing	6
Dismounted	7
Dismounting	8
Resynchronizing	9
SinglePageRestore	10
Seeding	11
SeedingSource	12
Initializing	13
Mounting	14
Mounted	15
Healthy	16

Note:

By default, this measure reports one of the Measure Values listed in the table above. In the graph of this measure however, the database status is indicated by the corresponding numeric equivalents only.

Content index state:

Indicates the current state of the content index of this database.

The values that this measure can take and their corresponding numeric values are as follows:

Measure Value	Numeric Value
ServiceDown	0
Suspended	1
ActivationSuspended	2
Failed	3
FailedAndSuspended	4
DisconnectedAndHealthy	5
DisconnectedAndResynchronizing	6
Dismounted	7
Dismounting	8
Resynchronizing	9
SinglePageRestore	10
Seeding	11
SeedingSource	12
Initializing	13
Mounting	14
Mounted	15
Healthy	16

Note:

By default, this measure reports one of the Measure Values listed in the table above. In the graph of this measure however, the content index state is indicated by the corresponding numeric equivalents only.

Percentage of disk free space:

Indicates the percentage of free space in this mailbox database.

Percent

A high value is desired for this measure. If this value grows dangerously close to 0, it indicates depletion of disk space. Compare the value of this measure across databases to know which database does not have enough free space. You may want to allocate more space to this database.

Disk free space:

Indicates the amount of space unused in this database.

Ideally, the value of this measure should be high. A steady decrease in this value indicates depletion of disk space. Compare the value of this measure across databases to know which database does not have enough free space. You may want to allocate more space to this database.

Disk total space:

Indicates the total disk capacity of this database.