ApsaraDB SQL Engine Monitoring - RDS Test

ApsaraDB for RDS is a stable, reliable, and scalable online database service. It is designed based on the Apsara Distributed File System and high-performance SSD storage media of Alibaba Cloud. It supports five database engines: MySQL, SQL Server, PostgreSQL, PPAS (compatible with Oracle), and MariaDB.

SQL Server supports the high-availability architecture and provides the capability to restore data to a specific point in time. This allows SQL Server to run on various enterprise applications. In addition, SQL Server is provided with a Microsoft-issued license. This relieves the need to purchase a license. ApsaraDB RDS for SQL Server also provides the advanced features and functions such as ApsaraDB for MyBase dedicated cluster, Read-only RDS instances to which read requests can be offloaded in times of an overload, Read/write splitting, and disk encryption.

To ensure a satisfactory user experience with a SQL server instance, the administrator should first measure how quickly the instance processes queries. This will reveal query processing bottlenecks, if any. Since poor buffer usage is a common cause for slow execution of queries, the administrator should then proceed to monitor buffer usage. While at it, if the administrator finds that the SQL buffers are not servicing too many queries, then he/she should investigate the reason for the same and fix it, so that queries execute faster.

In addition, the administrator should also ascertain whether/not the server instance is capable of bouncing back from a memory failure, with no data loss. For this, the administrator should track checkpoint writes to disk, and isolate slowness (if any) in the process. The locking activity should also be monitored, so that unhealthy locking behavior - eg., deadlock conditions and lock waits - are brought to administrator attention quickly. It is also important that administrators are forewarned of a storage space contention on an instance, so that they can see how they can make more space on that instance before it is too late!

To monitor all the aforesaid performance aspects of a SQL server instance and capture anomalies (if any) on-the-fly, the administrator can use the ApsaraDB SQL Engine Monitoring - RDS test.

For each SQL server instance that is configured using ApsaraDB for RDS, this test reports the rate at which queries and transactions are processed by that instance, thus shedding light on real/potential bottlenecks in query/transaction processing. If query processing is slow, then the administrator can use the buffer usage metric reported by the test to determine if poor buffer cache usage is the reason for query performance to suffer. Using the test, administrator can also check if each instance is keeping full table scans at a minimum. To reveal how resilient a SQL server instance is to power failures, the test also measures the speed with which checkpoints are written to disk by every instance; if the writes are performed slowly, the likelihood of data loss at the time of a power failure becomes high. The test also monitors the locking activity on each instance, and alerts administrators to frequent deadlock conditions and lock waits. Additionally, the test monitors the space usage at the instance level, and points administrators to those instances that are running out of space.

Target of the test : An Alibaba Cloud Account

Agent deploying the test : A remote agent

Outputs of the test : One set of results for every RDS instance for SQL Server

Configurable parameters for the test
Parameters	Description
Test period	How often should the test be executed
Host	The host for which the test is to be configured.
Alibaba Access Key and Alibaba Secret Key	This test makes REST API requests to the Alibaba cloud to pull the metrics. For this purpose, the test needs to be configured with an AccessKey pair. An AccessKey pair is typically used to call an operation of an Alibaba Cloud service. It is also used to initiate an API request or use a cloud service SDK to manager cloud resources. An AccessKey pair is characterized by an AccessKey ID and an AccessKey Secret. The AccessKey ID is used to identify a user/cloud account. The AccessKey Secret is used to verify a user/cloud account. The first step to configuring the eG agent with an AccessKey pair is to create an AccessKey pair for the target cloud acount. To achieve this, follow the steps below: Log on to the RAM console by using an Alibaba Cloud account. In the left-side navigation pane, click Users under Identities. On the Users page, click the username of the RAM user for which you want to create an AccessKey pair in the User Logon Name/Display Name column. On the page that appears, click Create AccessKey in the User AccessKeys section. Note: You must enter a verification code if you create an AccessKey pair for the first time. Click Close. Note: The AccessKey secret is displayed only when you create an AccessKey pair. If the AccessKey pair is leaked or lost, you must create a new one. You can create a maximum of two AccessKey pairs. Make note of the AccessKey ID and AccessKey secret, once they are displayed. Then, configure the Alibaba Access Key parameter of the test with the AccessKey ID, and the Alibaba Secret Key parameter with the AccessKey Secret you made note of. If you failed to make note of the AccessKey ID and AccessKey Secret at the time of creating the AccessKey pair, then you can obtain the same at a later point in time. Similarly, if an AccessKey pair pre-exists for the target cloud account, then you do not have to create another one. Instead, you can obtain the AccessKey ID and AccessKey Secret of the existing AccessKey pair and configure the eG agent with the same. For this, follow the steps below: Use an Alibaba Cloud account to log on to the Alibaba Cloud Management console. Move the pointer over the profile picture in the upper-right corner, and click AccessKey. In the Security Tips message that appears, click Continue to manage AccessKey. AccessKey ID and AccessKey Secret are displayed. Make note of the displayed ID and secret. Then, configure the Alibaba Access Key parameter of the test with the AccessKey ID, and the Alibaba Secret Key parameter with the AccessKey Secret you made note of.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test
Measurement	Description	Measurement Unit	Interpretation
Average query rate	Indicates the rate at which this instance processes queries.	Queries/Sec	A high value is desired for this measure. A low value signifies slowness in query processing. Compare the value of this measure across RDS instances to know which instance is processing queries slowly.
Average transactions rate	Indicates the rate at which transactions are processed by this instance.	Transactions/Sec	A high value is desired for this measure. A low value signifies slowness in transaction processing. Compare the value of this measure across RDS instances to know which instance is processing transactions slowly.
Buffer hit rate	Indicates the percentage of read requests to this instance that were serviced by the buffer cache.	Percent	Ideally, the value of this measure should be over 80%. A low value is indicative of poor cache usage. This in turn can cause queries to be served by directly accessing the disk, thereby degrading query performance and overall database health. To avoid this, the buffer cache should be well-sized, so that it can store many objects in memory. This will reduce the likelihood of cache misses, thereby improving query performance significantly.
Full table scan rate	Indicates the rate at which this instance performs full table scans.	Scans/Sec	When figuring out how best to access data within SQL Server, the query optimizer takes the query that you submit to it and performs some analysis of the objects involved. It tries to determine what indexes (just like the numbering of pages in a book) are available and if they are useful or not to your query, it tries to determine if the criteria in the where clause of the query allows it to use any of the indexes in an efficient manner. When there is no other efficient way to access data within the tables involved in the query, the query processor will process all of the rows in the table one at a time to give you the results. This is referred to as a Full Table Scan. This is the equivalent of having a book that doesn’t have any page numbers, but being told that somewhere in the book is some piece of information that you need to know. So, the only way to access this information is to look at every page one at a time searching for the information you need. Evidently, such an operation is both time-consuming and resource-intensive, and will adversely impact the performance of the SQL server! This is why, full table scans have to be avoided. Therefore, ideally, the value of this measure should be very low.
SQL compilation rate	Indicates the rate at which SQL queries are compiled by this instance.	Compilations/Sec
Checkpoints written rate	Indicates the rate at which checkpoints are written to disk by this instance.	Writes/Sec	A checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash. For performance reasons, the Database Engine performs modifications to database pages in memory-in the buffer cache-and does not write these pages to disk after every change. Rather, the Database Engine periodically issues a checkpoint on each database. A checkpoint writes the current in-memory modified pages (known as dirty pages) and transaction log information from memory to disk and, also records the information in the transaction log. If the checkpoint writes modified pages to disk very slowly, then at any given point in time, the disk may not be fully updated with recent changes to data. If at this juncture, a power failure occurs wiping out the pages in memory, the changes will be lost. To avoid this, the checkpoint should write pages to disk at a steady pace. Ideally therefore, the value of this measure should be high.
Logins rate	Indicates the rate of logins to this SQL server instance.	Logins/Sec
Lock rate	Indicates the number of locks held by this instance that timed out each second.	Locks/Sec	A lock wait timeout results when one user gets a lock on some data and holds it while another user tries to access it. If the first user doesn't unlock the data, the second one will time out after a while. The database will respond to the second user with an error message saying their lock wait was too long.
Deadlock rate	Indicates the rate at which deadlocks occurred on this instance.	Locks/Sec	A deadlock happens when multiple lock waits happen in such a manner that none of the users can do any further work. For example, the first user and second user both lock some data. Then each of them tries to access each other’s locked data. There’s a cycle in the locking: user A is waiting on B, and B is waiting on A. Neither can proceed. Deadlocks cause wasteful work delay, and are hence considered to be 'unproductive'. This is why, the value of this measure should be 0 or very low.
Lock waits rate	Indicates the rate of locks waits in this instance.	Locks/Sec	Lock wait event occurs when a user requests for a resource that is already locked by another user, forcing the former to wait until the latter releases the lock. Lock wait events on a database need to be minimal. If a lock is held on a resource for too long a time, all other requests will be denied access to that resource, thereby causing critical operations to fail. Moreover, if the number of lock waits grows over time, it will consequently increase the length of the pending requests queue; a long request queue may not only cause the unnecessary erosion of valuable server resources, it may also choke the server instance, thereby significantly impacting the quality of the user experience with that instance. This is why, the value of this measure should be 0 or very low for an instance.
Space usage	Indicates the amount of database space that is used by this instance.	MB	Compare the value of this measure across SQL server instances to identify the instance that is consuming space excessively.