AWS Backup Jobs Test

AWS Backup is a fully managed backup service that makes it easy to centralize and automate the backing up of data across AWS services.

With AWS Backup, you can create backup policies called backup plans. You can use these plans to define your backup requirements, such as how frequently to back up your data and how long to retain those backups. AWS Backup lets you apply backup plans to your AWS resources by simply tagging them. AWS Backup then automatically backs up your AWS resources according to the backup plan that you defined. With backup plans, you can even copy backups to multiple AWS accounts or AWS Regions on demand or automatically. These copies ensure that your backups are safe and secure. A backup vault is where AWS Backup stores all the resource backups. The content that is backed up at a specified time and stored in the vault is also referred to as a recovery point. Using AWS Backup, you can even restore from a specific recovery point, without any hassles.

Owing to such robust backup and restore capabilities, many organizations heavily rely on AWS Backup for implementing their DRS practices. In such organizations, if the backup, copy, restore jobs run by the AWS Backup service fail frequently or are sluggish, then, data integrity may be severely compromised! Similarly, the existence of partial recovery points in the vault can also challenge data integrity. As a result, the organizations may not have reliable, up-to-date backups to restore from, when disaster strikes! Data loss thus becomes imminent post recoveryt. If such adverse outcomes are to be avoided, administrators should track the status of backup / copy / restore jobs and recovery points, quickly spot failed/latent jobs / partial recovery points, and resolve issues before they wreck the DRS framework. This is exactly where the AWS Backup Jobs test helps!

This test monitors the status of the jobs run by the AWS Backup service, and alerts administrators if any backup / copy / restore job fails or is lethargic. Detailed diagnostics of this test points you to the problematic jobs, shares crucial job information with you, and thus helps you rapidly troubleshoot the failure / slowness. Likewise, the test also tracks the state of recovery points, and sends out notifications if any recovery point is incomplete. Using the detailed metrics reported by the test, you can swiftly identify the partial recovery points, and investigate why those recovery points are not fully formed. This way, the test provides useful pointers for fine-tuning backup plans and ensuring error-free, fail-safe backups and restorations.

Target of the test: Amazon Cloud

Agent deploying the test: A remote agent

Output of the test: One set of results for each ResourceType or VaultName, depending upon the awsbackup filter name chosen

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Access Type

eG Enterprise monitors the AWS cloud using AWS API. By default, the eG agent accesses the AWS API using a valid AWS account ID, which is assigned a special role that is specifically created for monitoring purposes. Accordingly, the Access Type parameter is set to Role by default. Furthermore, to enable the eG agent to use this default access approach, you will have to configure the eG tests with a valid AWS Account ID to Monitor and the special AWS Role Name you created for monitoring purposes.

Some AWS cloud environments however, may not support the role-based approach. Instead, they may allow cloud API requests only if such requests are signed by a valid Access Key and Secret Key. When monitoring such a cloud environment therefore, you should change the Access Type to Secret. Then, you should configure the eG tests with a valid AWS Access Key and AWS Secret Key.

Note that the Secret option may not be ideal when monitoring high-security cloud environments. This is because, such environments may issue a security mandate, which would require administrators to change the Access Key and Secret Key, often. Because of the dynamicity of the key-based approach, Amazon recommends the Role-based approach for accessing the AWS API.

AWS Account ID to Monitor

This parameter appears only when the Access Type parameter is set to Role. Specify the AWS Account ID that the eG agent should use for connecting and making requests to the AWS API. To determine your AWS Account ID, follow the steps below:

  • Login to the AWS management console. with your credentials.

  • Click on your IAM user/role on the top right corner of the AWS Console. You will see a drop-down menu containing the Account ID (see Figure 1).

    Identifying AWS Account ID

    Figure 1 : Identifying the AWS Account ID

AWS Role Name

This parameter appears when the Access Type parameter is set to Role. Specify the name of the role that you have specifically created on the AWS cloud for monitoring purposes. The eG agent uses this role and the configured Account ID to connect to the AWS Cloud and pull the required metrics. To know how to create such a role, refer to Creating a New Role.

AWS Access Key, AWS Secret Key, Confirm AWS Access Key, Confirm AWS Secret Key

These parameters appear only when the Access Type parameter is set to Secret.To monitor an Amazon instance, the eG agent has to be configured with the access key and secret key of a user with a valid AWS account. For this purpose, we recommend that you create a special user on the AWS cloud, obtain the access and secret keys of this user, and configure this test with these keys. The procedure for this has been detailed in the Obtaining an Access key and Secret key topic. Make sure you reconfirm the access and secret keys you provide here by retyping it in the corresponding Confirm text boxes.

Proxy Host and Proxy Port

In some environments, all communication with the AWS cloud and its regions could be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none , indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy User Name, Proxy Password, and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the proxy user name and proxy password parameters, respectively. Then, confirm the password by retyping it in the CONFIRM PASSWORD text box. By default, these parameters are set to none, indicating that the proxy sever does not require authentication by default.

Proxy Domain and Proxy Workstation

If a Windows NTLM proxy is to be configured for use, then additionally, you will have to configure the Windows domain name and the Windows workstation name required for the same against the proxy domain and proxy workstation parameters. If the environment does not support a Windows NTLM proxy, set these parameters to none.

Exclude Region

Here, you can provide a comma-separated list of region names or patterns of region names that you do not want to monitor. For instance, to exclude regions with names that contain 'east' and 'west' from monitoring, your specification should be: *east*,*west*

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measures reported by the test:

Measurement

Description

Measurement Unit

Interpretation

Backup jobs created

Indicates the number of backup jobs that AWS Backup created for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

Use the detailed diagnosis of this measure to know which backup jobs have been created, and to track their status and progress.

Backup jobs about to run

Indicates the number of backup jobs that are about to run for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

To view the details of the pending backup jobs, use the detailed diagnosis of this measure.

Backup jobs currently running

Indicates the number of backup jobs that are currently running for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

Use the detailed diagnosis of this measure to identify the jobs that are currently running, and to continuously track their status and progress.

Backup jobs cancelled by users

Indicates the number of backup jobs related to this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that were cancelled by users.

Number

To know which jobs were canceled / aborted by users, use the detailed diagnosis of this measure

Backup jobs that finished

Indicates the number of backup jobs related to this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that AWS Backup successfully finished.

Number

For the complete list of backup jobs that AWS Backup successfully finished, use the detailed diagnosis of this measure.

Backup jobs that failed

Indicates the number of backup jobs related to this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that failed.

Number

Ideally, the value of this measure should be 0. A non-zero value is a cause for concern, as it implies that one/more backup jobs have failed. You can use the detailed diagnosis of this measure to identify the jobs that failed and troubleshoot the failure.

Some of the common causes for backup job failures are as follows:

  • Scheduling a backup job during or 1 hour before a database resource or 4 hours before or during a Amazon FSx maintenance window or automated backup window, and not using AWS Backup to perform continuous backup for point-in-time restores;

  • Creating backups for DynamoDB tables while tables are being created.

  • Attempting to create a backup for an Amazon EPS file system while a previous one is still in progress;

  • Creating backup jobs for Amazon EBS after its soft quota of 100,000 backups per AWS Region per account is exceeded;

  • Not using AWS Backup to manage both Amazon RDS snapshots and continuous backups with point-in-time recovery;

  • Initiating a backup job from the Amazon RDS console, which can conflict with an Aurora clusters backup job;

  • Scheduling backup jobs for Amazon RDS, as scheduled jobs require KMS access;

Backup jobs that expired

Indicates the number of backup jobs related to this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that expired.

Number

When creating a backup plan in AWS Backup, you need to configure a backup 'lifecycle'. The lifecycle defines when a backup is transitioned to cold storage and when it expires. AWS Backup transitions and expires backups automatically according to the lifecycle that you define.

AWS Backup transitions data that is no longer referenced by warm backups, from a high-cost warm storage to low-cost cold storage. Backups that are transitioned to cold storage must be stored in cold storage for a minimum of 90 days. You can however, specify how long, after the minimum 90 days requirement, AWS Backup should store your backups in cold storage. In order to save you storage costs, AWS Backup automatically deletes your backups from cold storage at the end of this period. If for some reason, AWS Backup fails to automatically delete a backup even after it has exceeded its retention period, then the backup status changes to 'expired'.

Ideally therefore, the value 0 is desired for this measure. A non-zero value for this measure implies that one/more backups in cold storage could not be auto-deleted by the AWS Backup lifecycle rule. In such situations, use the detailed diagnosis of this measure to know which recovery points / backups could not be deleted. The most common reason for 'expired' backups is the absence of the permissions necessary for backup deletion. During the lifecycle run, if AWS Backup finds that the original AWS Identity and Access Management (IAM) role that created the recovery point was deleted, or if that IAM role does not have permissions to delete the backup / recovery point, then AWS Backup will fail the deletion and move the backup to the 'expired' state. Expired backups can only be manually deleted using the AWS management console.

Copy jobs that AWS backup created

Indicates the number of cross-account and cross-Region copy jobs that AWS Backup created for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

Use the detailed diagnosis of this measure to know which copy jobs have been created, and to track their status and progress.

Copy jobs currently running

Indicates the number of cross-account and cross-Region copy jobs currently running in AWS Backup for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down)

Number

Use the detailed diagnosis of this measure to know which copy jobs are currently running, and to track their status and progress.

Copy jobs that finished

Indicates the number of cross-account and cross-Region copy jobs for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that AWS Backup successfully finished.

Number

For the complete list of copy jobs that AWS Backup successfully finished, use the detailed diagnosis of this measure.

Copy jobs that failed

Indicates the number of cross-account and cross-Region copy jobs for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that failed.

Number

Ideally, the value of this measure should be 0. A non-zero value is a cause for concern, as it implies that one/more copy jobs have failed. You can use the detailed diagnosis of this measure to identify the jobs that failed and troubleshoot the failure.

To troubleshoot a failing cross-account copy, verify the following configurations:

  • Verify that your source and destination accounts belong to the same AWS Organization.

  • Verify that the resource type supports cross-account copying in the specified AWS Regions.

  • Verify the encryption criteria for your source account backup.

  • Verify that the source AWS Key Management Service (AWS KMS) key policy allows the destination account.

  • Verify that the destination vault access policy allows the source account.

  • Verify the AWS Organization tag policy configuration.

Restore jobs that are about to run

Indicates the number of restore jobs about to run in AWS Backup for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

Use the detailed diagnosis of this measure to identify the pending restore jobs, and track their status and progress.

Restore jobs currently running

Indicates the number of restore jobs currently running in AWS Backup for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

Use the detailed diagnosis of this measure to know which restore jobs are currently running, and to track their status and progress.

Restore jobs that finished

Indicates the number of restore jobs for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that AWS Backup successfully finished.

Number

For the complete list of restore jobs that AWS Backup successfully finished, use the detailed diagnosis of this measure.

Restore jobs that failed

Indicates the number of restore jobs for this resource type/vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that failed.

Number

Ideally, the value of this measure should be 0. A non-zero value is a cause for concern, as it implies that one/more restore jobs have failed. You can use the detailed diagnosis of this measure to identify the jobs that failed and troubleshoot the failure.

Recovery points created

Indicates the number of recovery points created for this resource type / in this vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

For the complete list of recovery points that are fully created, use the detailed diagnosis of this measure.

Partial recovery points

Indicates the number of partial recovery points created resource type / in this vault (depending upon the option chosen from the AWS Backup Filter Name drop-down).

Number

A 'Partial' recovery point is created if AWS Backup is unable to fully create the recovery point before the backup window closed. Ideally, there should not be any partial recovery points in a vault. This means that 0 is the desired value for this measure. If the measure reports a non-zero value, it indicates that one/more partial recovery points exist. In such a case, you can use the detailed diagnosis of this measure to identify these recovery points and delete them.

Its good practice to avoid the creation of partial recovery points. This can be achieved by increasing the backup plan window. You can either use the API or the console to update your backup plan.

Recovery points that expired

Indicates the number of recovery points created for this resource type / in this vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that expired.

Number

When creating a backup plan in AWS Backup, you need to configure a backup 'lifecycle'. The lifecycle defines when a backup / recovery point is transitioned to cold storage and when it expires. AWS Backup transitions and expires backups / recovery points automatically according to the lifecycle that you define.

AWS Backup transitions data that is no longer referenced by warm backups, from a high-cost warm storage to low-cost cold storage. Backups / recovery points that are transitioned to cold storage must be stored in cold storage for a minimum of 90 days. You can however, specify how much longer, after the minimum 90 days requirement, AWS Backup should store your recovery points in cold storage. In order to save you storage costs, AWS Backup automatically deletes your recovery points from cold storage at the end of this period. If for some reason, AWS Backup fails to automatically delete a recovery point even after it has exceeded its retention period, then the recovery point status changes to 'expired'.

Ideally therefore, the value 0 is desired for this measure. A non-zero value for this measure implies that one/more recovery points in cold storage could not be auto-deleted by the AWS Backup lifecycle rule. In such situations, use the detailed diagnosis of this measure to know which recovery points could not be deleted. The most common reason for 'expired' recovery points is the absence of the permissions necessary for deletion. During the lifecycle run, if AWS Backup finds that the original AWS Identity and Access Management (IAM) role that created the recovery point was deleted, or if that IAM role does not have permissions to delete the backup / recovery point, then AWS Backup will fail the deletion and move the recovery point to the 'expired' state. Expired recovery points can only be manually deleted using the AWS management console.

Recovery points that AWS backup is deleting

Indicates the number of recovery points created for this resource type / in this vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that AWS Backup is currently deleting.

Number

Use the detailed diagnosis of this measure to know which recovery points are being deleted.

Recovery points that AWS backup tiered to cold storage

Indicates the number of recovery points created for this resource type / in this vault (depending upon the option chosen from the AWS Backup Filter Name drop-down) that AWS Backup transitioned to cold storage.

Number

When creating a backup plan in AWS Backup, you need to configure a backup 'lifecycle'. The lifecycle defines when a backup / recovery point is transitioned to cold storage and when it expires.

AWS Backup transitions data that is no longer referenced by warm backups, from a high-cost warm storage to low-cost cold storage. This is done to save storage costs.

If this measure reports a non-zero value, it means that one/more recovery points have become obsolete, as they are no longer referenced. Use the detailed diagnosis of this measure to know which are the stale recovery points.