Recovery Service Vaults Test

A Recovery Services vault is a storage entity in Azure that houses data. The data is typically copies of data, or configuration information for virtual machines (VMs), workloads, servers, or workstations. You can use Recovery Services vaults to hold backup data for various Azure services such as IaaS VMs (Linux or Windows) and Azure SQL databases. The vault also stores recovery points created over time and backup policies associated with protected virtual machines. Recovery Services vaults support System Center DPM, Windows Server, Azure Backup Server, and more.

If the backup jobs keep failing or take too long to complete, then, when disaster strikes, a recent backup will not be available in the vault to enable seamless recovery. This can result in loss of critical business /configuration information. Administrators should therefore keep a close watch on the progress of backup jobs, and rapidly detect job delays and failures. Proactive detection of potential job failures is also essential, as it can help avert such irredeemable data losses.

Azure Backup automatically handles storage for the vault. It is important to know the storage replication type set for the vault, and how much of the redundant storage space is consumed by the backed up data. This will help administrators assess the storage requirement of the backups. Without this usage insight, there could come a time when there is not enough space in Azure storage for backups. In such situations, there is bound to be significant data loss.

To avoid backup failures, latencies, and storage space contentions in a Recovery Services Vault, administrators can periodically run the Recovery Service Vaults test. This test monitors all the Recovery Services Vaults configured for every resource group of a target Azure subscription. For each vault, the test monitors the status of that vault, and alerts administrators if any errors/abnormalities are noticed in the vault. Additionally, the test notifies administrators if backup/recovery jobs fail, and also if VMs/protected items in any vault are in a Critical/Warning state. Moreover, the test also measures the storage space consumed by each vault in local and geo-redundant storage. In the process, the test points you to vaults that may be over-utilizing redundant storage. Furthermore, the test also draws administrator attention to backup jobs with critical issues, so that administrators can quickly troubleshoot the issues and avert backup job failures.

Target of the Test: A Microsoft Azure Subscription

Agent deploying the test: A remote agent

Output of the test: One set of results for every recovery services vault configured for each resource group of the target subscription

Configurable parameters for the test
Parameters Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Subscription ID

Specify the GUID which uniquely identifies the Microsoft Azure Subscription to be monitored. To know the ID that maps to the target subscription, do the following:

  1. Login to the Microsoft Azure Portal.

  2. When the portal opens, click on the Subscriptions option (as indicated by Figure 1).

    Figure 1 : Clicking on the Subscriptions option

  3. Figure 2 that appears next will list all the subscriptions that have been configured for the target Azure AD tenant. Locate the subscription that is being monitored in the list, and check the value displayed for that subscription in the Subscription ID column.

    Figure 2 : Determining the Subscription ID

  4. Copy the Subscription ID in Figure 2 to the text box corresponding to the SUBSCRIPTION ID parameter in the test configuration page.

Tenant ID

Specify the Directory ID of the Azure AD tenant to which the target subscription belongs. To know how to determine the Directory ID, refer to Configuring the eG Agent to Monitor the Microsoft Azure App Service

Client ID and Client Password

The eG agent communicates with the target Microsoft Azure Subscrption using Java API calls. To collect the required metrics, the eG agent requires an Access token in the form of an Application ID and the client secret value. To know how to determine the Application ID and the key, refer to Configuring the eG Agent to Monitor the Microsoft Azure App Service. Specify the Application ID of the created Application in the Client ID text box and the client secret value in the Client Password text box.

Proxy Host

In some environments, all communication with the Azure cloud be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none, indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy Username, Proxy Password and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measures made by the test:
Measurement Description Measurement Unit Interpretation

Status

Indicates the current status of this recovery services vault.

 

The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Succeeded 1
Updating 2
Error 3

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current status of the recovery services vault. In The graph of this measure however, the same is represented using the numeric equivalents only.

Use the detailed diagnosis of this measure to know the location, tier, and recovery service type of the vault.

Backup management servers

Indicates the number of backup management servers available in this vault.

Number

 

Backup items

Indicates the number of items backed up in this vault.

Number

 

Virtual machines

Indicates the number of VMs in this vault.

Number

 

Protected items in critical state

Indicates the number of protected items in this vault that are in Critical state.

Number

These measures represent the replication health of protected items - i.e., items that are replication-enabled - in the vault.

If an item is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that item. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate.

If an item is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that item.

Ideally therefore, the value of these measures should be 0.

Protected items in warning state

Indicates the number of protected items in this vault that are in Warning state.

Number

Virtual machines in critical state

Indicates the number of VMs in this vault that are in Critical state.

Number

These measures represent the replication health of VMs in the vault.

If a VM is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that VM. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate.

If a VM is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that VM.

Ideally therefore, the value of these measures should be 0.

Virtual machines in warning state

Indicates the number of VMs in this vault that are in Warning state.

Number

Backup files and folders

Indicates the number of files and folders backed up to this vault.

Number

 

Data protection manager

Indicates the number of data protection managers registered with this vault.

Number

System Center Data Protection Manager (DPM) is a robust enterprise backup and recovery system that contributes to your BCDR strategy by facilitating the backup and recovery of enterprise data.

With DPM running on a physical server or on-premises VM, you can back up data to a Recovery Services vault in Azure, in addition to disk and tape backup. You can deploy DPM on an Azure VM, and can back up data to Azure disks attached to the VM, or back up the data to a Recovery Services vault.

Backup server

Indicates the number of backup servers in this vault.

Number

 

In progress

Indicates the number of backup jobs that are in progress in this vault.

Number

If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process backup jobs. This could warrant an investigation.

Failed backup jobs

Indicates the number of backup jobs in this vault that failed.

Number

Ideally, the value of is vaulthis measure should be 0.

Cloud - GRS

Indicates the amount of space that has been used by this vault in Geo redundant storage in cloud.

MB

Geo-redundant storage (GRS) copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to a single physical location in a secondary region that is hundreds of miles away from the primary region.

Compare the value of this measure with that of the Cloud - LRS measure to know which type of redundant storage is excessively utilized by the vault.

Cloud - LRS

Indicates the amount of space that has been used by this vault in locally- redundant storage in cloud.

MB

Locally redundant storage (LRS) replicates your data three times within a single data center in the primary region.

Compare the value of this measure with that of the Cloud - GRS measure to know which type of redundant storage is excessively utilized by the vault.

Protected instances

Indicates the number of managed instances in this vault.

Number

 

Deduplication - GRS

Indicates the amount of data that has been deduplicated from the geo-redundant storage used by this vault.

MB

Data Deduplication, often called Dedup for short, is a feature that can help reduce the impact of redundant data on storage costs. When enabled, Data Deduplication optimizes free space on a volume by examining the data on the volume by looking for duplicated portions on the volume. Duplicated portions of the volume's dataset are stored once and are (optionally) compressed for additional savings.

If the values of these measures are low, while the values of the Cloud - GRS and Cloud - LRS are consistently growing, it could mean that enough data has not been deduplicated.

Deduplication - LRS

Indicates the amount of data that has been deduplicated from the locally-redundant storage used by this vault.

MB

Backup engines disk usage

Indicates the amount of disk space used by the backup engine.

MB

A high value is indicative of excessive disk space usage by the backup engine.

Replicated items

Indicates the number of replicated items in this vault.

Number

 

Recovery plan

Indicates the number of recovery plans in this vault.

Number

A recovery plan gathers machines into recovery groups for the purpose of failover. A recovery plan helps you to define a systematic recovery process, by creating small independent units that you can fail over. A unit typically represents an app in your environment.

A recovery plan defines how machines fail over, and the sequence in which they start after failover. Recovery plans can be used for both failover to and failback from Azure.

Unhealthy servers

Indicates the number of unhealthy servers in this vault.

Number

Ideally, the value of this measure should be 0.

Updates available

Indicates the number of servers registered with this vault that have updates available.

Number

If this measure reports a non-zero value, it could mean that one/more servers in the vault are missing some important updates. In such a case, it would be wise to update the servers without any delay, as outdated servers can cause backup/recovery failures.

Unsupported servers

Indicates the number of unsupported servers in this vault.

Number

 

Supported servers

Indicates the number of supported servers in this vault.

Number

 

Events

Indicates the number of events generated during recovery jobs in this vault.

Number

 

Failed recovery jobs

Indicates the number of recovery jobs in this vault that failed.

Number

Ideally, the value of this measure should be 0.

Recovery jobs in progress

Indicates the number of recovery jobs that are in progress in this vault.

Number

If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process recovery jobs. This could warrant an investigation.

Jobs waiting for input

Indicates the number of recovery jobs in this vault that are waiting for input.

Number

 

Registered servers

Indicates the number of servers registered with this vault.

Number

 

Providers auth type

Indicates the number of authentication types provided by this vault.

Number

 

Replicating protected items

Indicates the number of protected items in this vault that are replicating currently.

Number

You perform a failover as part of your business continuity and disaster recovery (BCDR) strategy.

As a first step in your BCDR strategy, you replicate your on-premises items to Azure on an ongoing basis. Users access workloads and apps running on the on-premises sources.

If the need arises, for example if there's an outage on-premises, you fail the replicating items over to Azure.

Failed over protected items

Indicates the number of items that were failed over to this vault.

Number

Test failover applicable

Indicates the number of items in this vault that were failed over for test failover.

Number

You run a test failover to validate your replication and disaster recovery strategy, without any data loss or downtime. A test failover does not impact ongoing replication, or your production environment. You can run a test failover on a specific virtual machine (VM), or on a recovery plan containing multiple VMs.

HyperV to Azure

Indicates the number of HyperV VMs replicated to this vault.

Number

 

VMM to Azure

Indicates the number of VMM VMs replicated to this vault.

Number

 

VMware to Azure

Indicates the number of VMware VMs replicated to this vault.

Number

 

Azure to Azure

Indicates the number of Azure VMs replicated to this vault.

Number

 

Critical

Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Critical alert.

Number

In principle, any backup or recovery failure (scheduled or user triggered) would lead to generation of an alert and would be shown as a Critical alert and also destructive operations such as delete backup.

Ideally therefore, the value of this measure should be 0.

Warning

Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Warning alert.

Number

If the backup/recovery operation succeeds but with few warnings, they are listed as Warning alerts.

Ideally, the value of this measure should be 0.

Use the detailed diagnosis of the Status measure to know the location, tier, and recovery service type of the vault.

Figure 3 : The detailed diagnosis of the Status measure reported by the Recovery Service Vaults test