Exchange DAG Health Summary Test

A database availability group (DAG) is the base component of the Mailbox server high availability and site resilience framework built into Microsoft Exchange Server 2013/2016. A DAG is a group of up to 16 Mailbox servers that hosts a set of databases and provides automatic database-level recovery from failures that affect individual servers or databases.

After the DAG is created, Mailbox servers can be added to the DAG. When the first server is added to the DAG, a cluster is formed for use by the DAG. DAGs make use of Windows failover clustering technology, such as the cluster heartbeat, cluster networks, and the cluster database (for storing data that changes, such as database state changes from active to passive or vice versa, or from mounted to dismounted and vice versa). As each subsequent server is added to the DAG, it's joined to the underlying cluster, the cluster's quorum model is automatically adjusted by Exchange, and the server is added to the DAG object in Active Directory.

After Mailbox servers are added to a DAG, you can configure a variety of DAG properties, such as whether to use network encryption or network compression for database replication within the DAG. You can also configure DAG networks and create additional DAG networks.

After you add members to a DAG and configure the DAG, the active mailbox databases on each server can be replicated to the other DAG members. Each DAG member hosting a copy of a given mailbox database participates in a process of continuous replication to keep the copies consistent. Database replication occurs between Exchange Server 2013/2016 DAG members using two different methods:

  • File Mode replication – each transaction log is fully written (a 1MB log file) and then copied from the DAG member hosting the active database copy to each DAG member that hosts a passive database copy of that database. The other DAG members then replay the transaction log file into their own passive copy of the database to update it. 
  • Block mode replication – as each database transaction is written to the log buffer on the active server and also sent to the log buffer of DAG members hosting passive copies of the database. As the log buffer becomes full member of the DAG is then able to build their own transaction log file for replay into their passive database copy. 

Regardless of the replication mode, data consistency and integrity can be maintained only if the active and passive copies of the database are in sync and in good health at all times, the indexes are properly built, and the copy and replay queues are healthy. To ensure this, administrators must periodically run health checks on the database copies, indexes, and queues in the DAG. This can be achieved using the Exchange DAG Health Summary test. This test monitors the state of the database copies, queues, and indexes in a DAG, and reports the count of unhealthy database copies, queues, and indexes. This indicates how healthy/reliable the DAG is.     

Target of the test : A Microsoft Exchange 2013/2016 server

Agent deploying the test : An internal agent

Outputs of the test : One set of results for the Microsoft Exchange 2013/2016 server being monitored

Configurable parameters for the test
  1. Test period - How often should the test be executed
  2. Host - The host for which the test is to be configured.
  3. port – The port at which the host listens.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Preference:

Indicates the activation preference number configured for the DAG.

Number

When creating a mailbox data copy, you can specify the activation preference number, which is used as part of Active Manager's best copy selection process. It's also used to redistribute active mailbox databases throughout the DAG when using the RedistributeActiveDatabases.ps1 script. The value for the activation preference is a number equal to or greater than one, where one is at the top of the preference order. The position number cannot be larger than the number of mailbox database copies.

Total copies:

Indicates the total number of mailbox database copies in the DAG.

Number

 

Healthy copies:

Indicates the number of database copies in the DAG that are healthy.

Number

A mailbox database copy is in a Healthy state it implies that it is successfully copying and replaying log files, or it has successfully copied and replayed all available log files.

A high value is desired for this measure

Unhealthy copies:

Indicates the number of database copies in the DAG that are in an unhealthy state.

Number

A mailbox database copy is said to be unhealthy, if it is in the Failed, Suspended, or the Failed and Suspended state.

The mailbox database copy is in a Failed state because it isn't suspended, and it isn't able to copy or replay log files. While in a Failed state and not suspended, the system will periodically check whether the problem that caused the copy status to change to Failed has been resolved. After the system has detected that the problem is resolved, and barring no other issues, the copy status will automatically change to Healthy.

The mailbox database copy is in a Suspended state as a result of an administrator manually suspending the database copy by running the Suspend-MailboxDatabaseCopy cmdlet.

The Failed and Suspended states is set simultaneously by the system because a failure was detected, and because resolution of the failure explicitly requires administrator intervention. An example is if the system detects unrecoverable divergence between the active mailbox database and a database copy. Unlike the Failed state, the system won't periodically check whether the problem has been resolved, and automatically recover. Instead, an administrator must intervene to resolve the underlying cause of the failure before the database copy can be transitioned to a healthy state.

Healthy queues:

Indicates the number of healthy queues in the DAG.

Number

A high value is ideal for this measure.

 

Unhealthy queues:

Indicates the number of unhealthy queues in the DAG.

Number

A low value is desired for this measure.

Lagged queues:

Indicates the number of lagged queues in the DAG.

Number

A lagged database copy is a passive database copy in a database availability group that has a delayed log replay time configured.

Normally a passive database copy will replay the transaction log data into the database immediately, so that the passive database copy is as up to date as possible.

With a lagged database copy the administrator sets a delay on the log replay, so that the database copy “lags” behind the others in terms of the latest database changes. This lag interval specifies the amount of time between when a transaction log file is generated and when it is replayed into the passive database copy. The default lag interval is 0 and the maximum lag interval is 14 days.

If the value of this measure is very high, it implies that many database copy lags are occurring. You may want to consider increasing the lag interval to minimize the queue count.

Healthy indexes:

Indicates the number of healthy indexes in the DAG.

Number

 

Unhealthy indexes:

Indicates the number of unhealthy indexes in the DAG.

Number

A low value is desired for this measure.