Recovery Service Vaults Test
A Recovery Services vault is a storage entity in Azure that houses data. The data is typically copies of data, or configuration information for virtual machines (VMs), workloads, servers, or workstations. You can use Recovery Services vaults to hold backup data for various Azure services such as IaaS VMs (Linux or Windows) and Azure SQL databases. The vault also stores recovery points created over time and backup policies associated with protected virtual machines. Recovery Services vaults support System Center DPM, Windows Server, Azure Backup Server, and more.
If the backup jobs keep failing or take too long to complete, then, when disaster strikes, a recent backup will not be available in the vault to enable seamless recovery. This can result in loss of critical business /configuration information. Administrators should therefore keep a close watch on the progress of backup jobs, and rapidly detect job delays and failures. Proactive detection of potential job failures is also essential, as it can help avert such irredeemable data losses.
Azure Backup automatically handles storage for the vault. It is important to know the storage replication type set for the vault, and how much of the redundant storage space is consumed by the backed up data. This will help administrators assess the storage requirement of the backups. Without this usage insight, there could come a time when there is not enough space in Azure storage for backups. In such situations, there is bound to be significant data loss.
To avoid backup failures, latencies, and storage space contentions in a Recovery Services Vault, administrators can periodically run the Recovery Service Vaults test. This test monitors all the Recovery Services Vaults configured for every resource group of a target Azure subscription. For each vault, the test monitors the status of that vault, and alerts administrators if any errors/abnormalities are noticed in the vault. Additionally, the test notifies administrators if backup/recovery jobs fail, and also if VMs/protected items in any vault are in a Critical/Warning state. Moreover, the test also measures the storage space consumed by each vault in local and geo-redundant storage. In the process, the test points you to vaults that may be over-utilizing redundant storage. Furthermore, the test also draws administrator attention to backup jobs with critical issues, so that administrators can quickly troubleshoot the issues and avert backup job failures.
Target of the Test: A Microsoft Azure Subscription
Agent deploying the test: A remote agent
Output of the test: One set of results for every recovery services vault configured for each resource group of the target subscription
Parameters | Description |
---|---|
Test Period |
How often should the test be executed. |
Host |
The host for which the test is to be configured. |
Subscription ID |
Specify the GUID which uniquely identifies the Microsoft Azure Subscription to be monitored. To know the ID that maps to the target subscription, do the following:
|
Tenant ID |
Specify the Directory ID of the Azure AD tenant to which the target subscription belongs. To know how to determine the Directory ID, refer to Configuring the eG Agent to Monitor a Microsoft Azure Subscription Using Azure ARM REST API. |
Client ID, Client Password, and Confirm Password |
To connect to the target subscription, the eG agent requires an Access token in the form of an Application ID and the client secret value. For this purpose, you should register a new application with the Azure AD tenant. To know how to create such an application and determine its Application ID and client secret, refer to Configuring the eG Agent to Monitor a Microsoft Azure Subscription Using Azure ARM REST API. Specify the Application ID of the created Application in the Client ID text box and the client secret value in the Client Password text box. Confirm the Client Password by retyping it in the Confirm Password text box. |
Proxy Host and Proxy Port |
In some environments, all communication with the Azure cloud be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none, indicating that the eG agent is not configured to communicate via a proxy, by default. |
Proxy Username, Proxy Password and Confirm Password |
If the proxy server requires authentication, then, specify a valid proxy user name and password in the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box. |
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Status |
Indicates the current status of this recovery services vault. |
|
The values reported by this measure and its numeric equivalents are mentioned in the table below:
Note: By default, this measure reports the Measure Values listed in the table above to indicate the current status of the recovery services vault. In The graph of this measure however, the same is represented using the numeric equivalents only. Use the detailed diagnosis of this measure to know the location, tier, and recovery service type of the vault. |
||||||||
Backup management servers |
Indicates the number of backup management servers available in this vault. |
Number |
|
||||||||
Backup items |
Indicates the number of items backed up in this vault. |
Number |
|
||||||||
Virtual machines |
Indicates the number of VMs in this vault. |
Number |
|
||||||||
Protected items in critical state |
Indicates the number of protected items in this vault that are in Critical state. |
Number |
These measures represent the replication health of protected items - i.e., items that are replication-enabled - in the vault. If an item is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that item. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate. If an item is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that item. Ideally therefore, the value of these measures should be 0. |
||||||||
Protected items in warning state |
Indicates the number of protected items in this vault that are in Warning state. |
Number |
|||||||||
Virtual machines in critical state |
Indicates the number of VMs in this vault that are in Critical state. |
Number |
These measures represent the replication health of VMs in the vault. If a VM is in the Critical state, it implies that one or more critical replication error symptoms have been detected in that VM. These error symptoms are typically indicators that replication stuck, or not progressing as fast as the data change rate. If a VM is in the Warning state, it implies that one or more warning symptoms that might impact replication are detected in that VM. Ideally therefore, the value of these measures should be 0. |
||||||||
Virtual machines in warning state |
Indicates the number of VMs in this vault that are in Warning state. |
Number |
|||||||||
Backup files and folders |
Indicates the number of files and folders backed up to this vault. |
Number |
|
||||||||
Data protection manager |
Indicates the number of data protection managers registered with this vault. |
Number |
System Center Data Protection Manager (DPM) is a robust enterprise backup and recovery system that contributes to your BCDR strategy by facilitating the backup and recovery of enterprise data. With DPM running on a physical server or on-premises VM, you can back up data to a Recovery Services vault in Azure, in addition to disk and tape backup. You can deploy DPM on an Azure VM, and can back up data to Azure disks attached to the VM, or back up the data to a Recovery Services vault. |
||||||||
Backup server |
Indicates the number of backup servers in this vault. |
Number |
|
||||||||
In progress |
Indicates the number of backup jobs that are in progress in this vault. |
Number |
If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process backup jobs. This could warrant an investigation. |
||||||||
Failed backup jobs |
Indicates the number of backup jobs in this vault that failed. |
Number |
Ideally, the value of is vaulthis measure should be 0. |
||||||||
Cloud - GRS |
Indicates the amount of space that has been used by this vault in Geo redundant storage in cloud. |
MB |
Geo-redundant storage (GRS) copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to a single physical location in a secondary region that is hundreds of miles away from the primary region. Compare the value of this measure with that of the Cloud - LRS measure to know which type of redundant storage is excessively utilized by the vault. |
||||||||
Cloud - LRS |
Indicates the amount of space that has been used by this vault in locally- redundant storage in cloud. |
MB |
Locally redundant storage (LRS) replicates your data three times within a single data center in the primary region. Compare the value of this measure with that of the Cloud - GRS measure to know which type of redundant storage is excessively utilized by the vault. |
||||||||
Protected instances |
Indicates the number of managed instances in this vault. |
Number |
|
||||||||
Deduplication - GRS |
Indicates the amount of data that has been deduplicated from the geo-redundant storage used by this vault. |
MB |
Data Deduplication, often called Dedup for short, is a feature that can help reduce the impact of redundant data on storage costs. When enabled, Data Deduplication optimizes free space on a volume by examining the data on the volume by looking for duplicated portions on the volume. Duplicated portions of the volume's dataset are stored once and are (optionally) compressed for additional savings. If the values of these measures are low, while the values of the Cloud - GRS and Cloud - LRS are consistently growing, it could mean that enough data has not been deduplicated. |
||||||||
Deduplication - LRS |
Indicates the amount of data that has been deduplicated from the locally-redundant storage used by this vault. |
MB |
|||||||||
Backup engines disk usage |
Indicates the amount of disk space used by the backup engine. |
MB |
A high value is indicative of excessive disk space usage by the backup engine. |
||||||||
Replicated items |
Indicates the number of replicated items in this vault. |
Number |
|
||||||||
Recovery plan |
Indicates the number of recovery plans in this vault. |
Number |
A recovery plan gathers machines into recovery groups for the purpose of failover. A recovery plan helps you to define a systematic recovery process, by creating small independent units that you can fail over. A unit typically represents an app in your environment. A recovery plan defines how machines fail over, and the sequence in which they start after failover. Recovery plans can be used for both failover to and failback from Azure. |
||||||||
Unhealthy servers |
Indicates the number of unhealthy servers in this vault. |
Number |
Ideally, the value of this measure should be 0. |
||||||||
Updates available |
Indicates the number of servers registered with this vault that have updates available. |
Number |
If this measure reports a non-zero value, it could mean that one/more servers in the vault are missing some important updates. In such a case, it would be wise to update the servers without any delay, as outdated servers can cause backup/recovery failures. |
||||||||
Unsupported servers |
Indicates the number of unsupported servers in this vault. |
Number |
|
||||||||
Supported servers |
Indicates the number of supported servers in this vault. |
Number |
|
||||||||
Events |
Indicates the number of events generated during recovery jobs in this vault. |
Number |
|
||||||||
Failed recovery jobs |
Indicates the number of recovery jobs in this vault that failed. |
Number |
Ideally, the value of this measure should be 0. |
||||||||
Recovery jobs in progress |
Indicates the number of recovery jobs that are in progress in this vault. |
Number |
If the value of this measure grows consistently, it could imply that the vault is taking longer than usual to process recovery jobs. This could warrant an investigation. |
||||||||
Jobs waiting for input |
Indicates the number of recovery jobs in this vault that are waiting for input. |
Number |
|
||||||||
Registered servers |
Indicates the number of servers registered with this vault. |
Number |
|
||||||||
Providers auth type |
Indicates the number of authentication types provided by this vault. |
Number |
|
||||||||
Replicating protected items |
Indicates the number of protected items in this vault that are replicating currently. |
Number |
You perform a failover as part of your business continuity and disaster recovery (BCDR) strategy. As a first step in your BCDR strategy, you replicate your on-premises items to Azure on an ongoing basis. Users access workloads and apps running on the on-premises sources. If the need arises, for example if there's an outage on-premises, you fail the replicating items over to Azure. |
||||||||
Failed over protected items |
Indicates the number of items that were failed over to this vault. |
Number |
|||||||||
Test failover applicable |
Indicates the number of items in this vault that were failed over for test failover. |
Number |
You run a test failover to validate your replication and disaster recovery strategy, without any data loss or downtime. A test failover does not impact ongoing replication, or your production environment. You can run a test failover on a specific virtual machine (VM), or on a recovery plan containing multiple VMs. |
||||||||
HyperV to Azure |
Indicates the number of HyperV VMs replicated to this vault. |
Number |
|
||||||||
VMM to Azure |
Indicates the number of VMM VMs replicated to this vault. |
Number |
|
||||||||
VMware to Azure |
Indicates the number of VMware VMs replicated to this vault. |
Number |
|
||||||||
Azure to Azure |
Indicates the number of Azure VMs replicated to this vault. |
Number |
|
||||||||
Critical |
Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Critical alert. |
Number |
In principle, any backup or recovery failure (scheduled or user triggered) would lead to generation of an alert and would be shown as a Critical alert and also destructive operations such as delete backup. Ideally therefore, the value of this measure should be 0. |
||||||||
Warning |
Indicates the number of backup/recovery jobs in this vault that lead to the generation of a Warning alert. |
Number |
If the backup/recovery operation succeeds but with few warnings, they are listed as Warning alerts. Ideally, the value of this measure should be 0. |
Use the detailed diagnosis of the Status measure to know the location, tier, and recovery service type of the vault.
Figure 3 : The detailed diagnosis of the Status measure reported by the Recovery Service Vaults test