AWS DynamoDB Replication Region Availability Test

A global table is a collection of one or more replica tables, all owned by a single AWS account. A replica table (or replica, for short) is a single DynamoDB table that functions as a part of a global table. Each replica stores the same set of data items. Any given global table can only have one replica table per AWS region. If a single AWS region becomes unavailable, then the replica table will also become unavailable to the application that performes reads and writes on the table. This application can redirect to a different region and perform reads and writes against a different replica table. You can apply custom business logic to determine when to redirect requests to other regions. If a region becomes isolated or degraded, DynamoDB keeps track of any writes that have been performed but have not yet been propagated to all of the replica tables. When the region comes back online, DynamoDB resumes propagating any pending writes from that region to the replica tables in other regions. It also resumes propagating writes from other replica tables to the region that is now back online. However, continuous unavailability and high response time of the AWS regions can cause replication latency, which when left unattended leads to throttling and loss of critical data.

This test monitors each AWS region on which the DynamoDB tables are replicated and reports the availability and responsiveness of the regions. In addition, this test also reports the connection time and query execution time on each replication region. In the process, this test helps administrators be promptly alerted to any region unavailability beforehand and thereby avoid any mishaps.

Target of the test : An AWS DynamoDB server

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each AWS replication region on which DynamoDB tables are replicated being monitored.

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the AWS DynamoDB server that is being monitored.

AWS Region

This test uses AWS SDK to interact with AWS DynamoDB and pull relevant metrics. To enable the test to connect to AWS, you need to configure the test with the name of the region to which all requests for metrics should be routed, by default. Specify the name of this AWS Region in this text box.

AWS Access Key ID, AWS Secret Access Key and Confirm Password

To monitor AWS DynamoDB, the eG agent has to be configured with the access key and secret key of a user with a valid AWS account. For this purpose, we recommend that you create a special user on the AWS cloud, obtain the access and secret keys of this user, and configure this test with these keys. The procedure for this has been detailed in the Obtaining an Access key and Secret key topic. Make sure you reconfirm the access and secret keys you provide here by retyping it in the corresponding Confirm Password text box.

Timeout Seconds

Specify the maximum duration (in seconds) for which the test will wait for a response from the server. The default is 120 seconds.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Region availability

Indicates whether this region is available or not.

Percent

The availability is 100% when the region does respond to a request and 0% when it is not. This measure will report that the region is unavailable even if a connection to the region is unavailable, or if a query to the region fails. In this case, you can check the values of the Region connection availability and Query processor availability measures to know what is exactly causing the region to not respond to requests - is it owing to a connection unavailability? or is it due to a query failure?

The detailed diagnosis of this measure shows the exact error message received while connecting to region in Details of connection availability field.

Region response time

Indicates the time taken by this region to respond to a user query. This includes both connection time and query execution time.

Seconds

A sudden increase in response time is indicative of a bottleneck with the target region.

Region connection availability

Indicates whether the connection to this region is available or not.

Percent

If this measure reports the value 100 , it indicates that the connection to this region is available. The value 0 on the other hand indicates that the connection to this region is unavailable. If the Region availability measure reports the value 0, then, you can check the value of this measure to determine whether/not it is due to the unavailability of a connection to the region.

Connection time to replication region

Indicates the time taken to connect to this region.

Seconds

A high value could indicate a connection bottleneck. Whenever the response time measure soars, you may want to check the value of this measure to determine whether a connection latency is causing the poor responsiveness of the server.

Query processor availability

Indicates whether the query processor is available or not.

Percent

If this measure reports the value 100, it indicates that the processor is available and query executed successfully. The value 0 on the other hand indicates unavailability of the processor and that the query failed. In the event that the Server availability measure reports the value 0, check the value of this measure to figure out whether the failed query is the reason why that measure reported a server unavailability.

Query execution time on replication region

Indicates the time taken to execute a database query on this region.

Seconds

A high value could indicate that one/more queries to the region are taking too long to execute. Inefficient/badly designed queries often run for long periods. If the value of this measure is higher than that of the Connection time measure, you can be rest assured that long running queries are the ones causing the responsiveness of the server to suffer.

Records fetched

Indicates the number of records fetched from this region.

Number

The value 0 indicates that no records are fetched from the region.