Oracle RAC MTTR Test
Instance recovery, which is the process of recovering the redo thread from the failed instance, is a critical component affecting availability. When using Oracle RAC, the SMON process in one surviving instance performs instance recovery of the failed instance. The sooner this happens and lesser the I/O that is consumed during recovery, the better will be the user experience with the Oracle RAC.
Mean time to recovery (MTTR) is the average time that the Oracle server will take to recover from any failure. In order to limit recovery I/O and optimize cluster performance, you need to understand the MTTR target your system is currently achieving and what your potential MTTR target could be, given the I/O capacity. This test provides you with this understanding by reporting the target and estimated MTTR, and by monitoring the key factors affecting MTTR such as the redo log size and the number of redo blocks to be processed.
This test is disabled by default. To enable the test, go to the enable / disable tests page using the menu sequence : Agents -> Tests -> Enable/Disable, pick the Oracle RAC as desired Component type, set Performance as the Test type, choose the test from the disabled tests list, and click on the >> button to move the test to the ENABLED TESTS list. Finally, click the Update button.
Target of the test : Oracle RAC
Agent deploying the test : An internal/remote agent
Outputs of the test : One set of results for every instance in the monitored Oracle RAC.
|
Measurement | Description | Measurement Unit | Interpretation |
---|---|---|---|
Target MTTR: |
Indicates the effective mean time to recover (MTTR) this instance. |
Secs |
Usually, the value of this measure should be equal to the value of the FAST_START_MTTR_TARGET initialization parameter. FAST_START_MTTR_TARGET specifies a target for the expected mean time to recover (MTTR), that is, the time (in seconds) that it should take to start up the instance and perform cache recovery. After FAST_START_MTTR_TARGET is set, the database manages incremental checkpoint writes in an attempt to meet that target. If FAST_START_MTTR_TARGET is set to such a small value that it is impossible to do a recovery within its time frame, then the the value of this measure will be larger than FAST_START_MTTR_TARGET. If FAST_START_MTTR_TARGET is set to such a high value that even in the worst-case (the whole buffer cache is dirty) recovery would not take that long, then the value of this measure will be the same as the Estimated MTTR. If FAST_START_MTTR_TARGET is not specified, then again, the value of this measure will be the same as the value of the Estimated MTTR measure. |
Estimated MTTR: |
Indicates the current estimated mean time to recover (MTTR). |
Secs |
This measure is calculated based on the number of dirty buffers and log blocks (0 if FAST_START_MTTR_TARGET is not specified). Basically, this value tells you how long you could expect recovery of the instance to take place based on the work your system is doing at the time of testing. This measure reports the estimated mean time to recovery based on the current state of the running database. If the database has just opened, the system may contain only a few dirty buffers, so not much cache recovery would be required if the instance failed at this moment. That is why the value of this measure can, for the moment, be lower than the minimum possible Target MTTR. |
Recovery Estimated IOs: |
Indicates the estimated number of dirty buffers in the buffer cache of this instance. |
Number |
|
Target Redo blocks: |
Indicates the target number of redo blocks that must be processed while recovering this instance. |
Number |
Instance recovery is nothing more than using the contents of the online log files to rebuild the database buffer cache to the state it was in before the crash. This will replay all changes extracted from the redo logs that refer to blocks that had not been written to disk at the time of the crash. Though instance recovery guarantees no corruption, it may take a considerable time to do its roll forward before the database can be opened. This time is dependent on two factors: how much redo has to be read and how many read/write operations will be needed on the datafiles as the redo is applied. The values of these measures serve as good indicators of the amount of redo reading work that needs to be performed as part of the recovery process, and are hence useful while determining the MTTR. |
Actual Redo blocks: |
Indicates the actual number of redo blocks that are required by this Oracle instance to recover. |
Number |
|
Writes logfile size: |
Indicates the number of writes driven by the smallest redo log file size for each oracle instance. |
Number |
This measure is used to drive the checkpoint process, if your redo log file size is under sized. Since the FAST_START_MTTR_TARGET parameter is set to limit the instance recovery time, Oracle automatically tries to checkpoint as frequently as necessary. Under such a condition, the size of the log files should be large enough to avoid additional checkpoint due to under sized log files. |
Writes auto tune: |
Indicates the number of writes due to auto-tune checkpointing.
|
Number |
The checkpoint auto-tuning mechanism inspects statistics on machine utilization, such as the rate of disk I/O and CPU usage, and if it appears that there is spare capacity, it will use this capacity to write out additional dirty buffers from the database buffer cache, thus pushing the checkpoint position forward. The result is that even if the FAST_START_MTTR_TARGET parameter is set to a high value (the highest possible is 3600 seconds—anything above that will be rounded down), actual recovery time may well be much less. Enabling checkpoint auto-tuning with a high target should result in your instance always having the fastest possible recovery time that is consistent with maximum performance. |