Oracle RAC MTTR Test

Instance recovery, which is the process of recovering the redo thread from the failed instance, is a critical component affecting availability. When using Oracle RAC, the SMON process in one surviving instance performs instance recovery of the failed instance. The sooner this happens and lesser the I/O that is consumed during recovery, the better will be the user experience with the Oracle RAC.

Mean time to recovery (MTTR) is the average time that the Oracle server will take to recover from any failure. In order to limit recovery I/O and optimize cluster performance, you need to understand the MTTR target your system is currently achieving and what your potential MTTR target could be, given the I/O capacity. This test provides you with this understanding by reporting the target and estimated MTTR, and by monitoring the key factors affecting MTTR such as the redo log size and the number of redo blocks to be processed.

This test is disabled by default. To enable the test, go to the enable / disable tests page using the menu sequence : Agents -> Tests -> Enable/Disable, pick the Oracle RAC as desired Component type, set Performance as the Test type, choose the test from the disabled tests list, and click on the >> button to move the test to the ENABLED TESTS list. Finally, click the Update button.

Target of the test : Oracle RAC

Agent deploying the test : An internal/remote agent

Outputs of the test : One set of results for every instance in the monitored Oracle RAC.

Configurable parameters for the test
  1. TEST PERIOD - How often should the test be executed.
  2. Host – The host for which the test is to be configured.
  3. Port - The port on which the server is listening.
  4. orasid - The variable name of the oracle instance.
  5. service name - A ServiceName exists for the entire Oracle RAC system. When clients connect to an Oracle cluster using the ServiceName, then the cluster routes the request to any available database instance in the cluster. By default, the service name is set to none. In this case, the test connects to the cluster using the orasid and pulls out the metrics from that database instance which corresponds to that orasid. If a valid service name is specified instead, then, the test will connect to the cluster using that service name, and will be able to pull out metrics from any available database instance in the cluster.

    To know the ServiceName of a cluster, execute the following query on any node in the target cluster:

    select name, value from v$parameter where name =’service_names’

  6. User – In order to monitor an Oracle database server, a special database user account has to be created in every Oracle database instance that requires monitoring. A Click here hyperlink is available in the test configuration page, using which a new oracle database user can be created. Alternatively, you can manually create the special database user. When doing so, ensure that this user is vested with the select_catalog_role and create session privileges.

    The sample script we recommend for user creation (in Oracle database server versions before 12c) for eG monitoring is:

    create user oraeg identified by oraeg ;

    create role oratest;

    grant create session to oratest;

    grant select_catalog_role to oratest;

    grant oratest to oraeg;

    The sample script we recommend for user creation (in Oracle database server 12c) for eG monitoring is:

    alter session set container=<Oracle_service_name>;

    create user <user_name>identified by <user_password> container=current default tablespace <name_of_default_tablespace> temporary tablespace <name_of_temporary_tablespace>;

    Grant create session to <user_name>;                                 

    Grant select_catalog_role to <user_name>;

    The name of this user has to be specified here.

  7. Password – Password of the specified database user
  8. Confirm password – Confirm the password by retyping it here.
  9. ISPASSIVE – If the value chosen is yes, then the Oracle server under consideration is a passive server in an Oracle cluster. No alerts will be generated if the server is not running. Measures will be reported as “Not applicable’ by the agent if the server is not up.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Target MTTR:

Indicates the effective mean time to recover (MTTR) this instance.

Secs

Usually, the value of this measure should be equal to the value of the FAST_START_MTTR_TARGET initialization parameter. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery.

After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target.

If FAST_START_MTTR_TARGET is set to such a small value that it is impossible to do a recovery within its time frame, then the the value of this measure will be larger than FAST_START_MTTR_TARGET. If FAST_START_MTTR_TARGET is set to such a high value that even in the worst-case (the whole buffer cache is dirty) recovery would not take that long, then the value of this measure will be the same as the Estimated MTTR.

If FAST_START_MTTR_TARGET is not specified, then again, the value of this measure will be the same as the value of the Estimated MTTR measure.

Estimated MTTR:

Indicates the current estimated mean time to recover (MTTR).

Secs

This measure is calculated based on the number of dirty buffers and log blocks (0 if FAST_START_MTTR_TARGET is not specified). Basically, this value tells you how long you could expect recovery of the instance to take place based on the work your system is doing at the time of testing.

This measure reports the estimated mean time to recovery based on the current state of the running database. If the database has just opened, the system may contain only a few dirty buffers, so not much cache recovery would be required if the instance failed at this moment. That is why the value of this measure can, for the moment, be lower than the minimum possible Target MTTR.

Recovery Estimated IOs:

Indicates the estimated number of dirty buffers in the buffer cache of this instance.

Number

 

Target Redo blocks:

Indicates the target number of redo blocks that must be processed while recovering this instance.

Number

Instance recovery is nothing more than using the contents of the online log files to rebuild the database buffer cache to the state it was in before the crash. This will replay all changes extracted from the redo logs that refer to blocks that had not been written to disk at the time of the crash. Though instance recovery guarantees no corruption, it may take a considerable time to do its roll forward before the database can be opened. This time is dependent on two factors: how much redo has to be read and how many read/write operations will be needed on the datafiles as the redo is applied. The values of these measures serve as good indicators of the amount of redo reading work that needs to be performed as part of the recovery process, and are hence useful while determining the MTTR.  

Actual Redo blocks:

Indicates the actual number of redo blocks that are required by this Oracle instance to recover.

Number

Writes logfile size:

Indicates the number of writes driven by the smallest redo log file size for each oracle instance.

Number

This measure is used to drive the checkpoint process, if your redo log file size is under sized. Since the FAST_START_MTTR_TARGET parameter is set to limit the instance recovery time, Oracle automatically tries to checkpoint as frequently as necessary. Under such a condition, the size of the log files should be large enough to avoid additional checkpoint due to under sized log files.

Writes auto tune:

Indicates the number of writes due to auto-tune checkpointing.

 

Number

The checkpoint auto-tuning mechanism inspects statistics on machine utilization, such as the rate of disk I/O and CPU usage, and if it appears that there is spare capacity, it will use this capacity to write out additional dirty buffers from the database buffer cache, thus pushing the checkpoint position forward. The result is that even if the FAST_START_MTTR_TARGET parameter is set to a high value (the highest possible is 3600 seconds—anything above that will be rounded down), actual recovery time may well be much less.

Enabling checkpoint auto-tuning with a high target should result in your instance always having the fastest possible recovery time that is consistent with maximum performance.