Failover Cluster Operational Log Test

Windows Failover Clustering allows multiple nodes to work together to provide high availability for critical applications and services. The Operational log of the Failover-Clustering subsystem records detailed runtime events related to cluster configuration changes, node state transitions, quorum status, resource load behavior, and cluster-level problem conditions. Any abnormal behavior such as repeated node transitions, resource failures, loss of connectivity, quorum changes or failover instability can directly affect workload availability, reduce service resilience, and lead to unexpected downtime for users.

Continuous monitoring of the cluster’s operational log helps administrators detect underlying issues early - long before a full failover or outage occurs. This test monitors the Failover Cluster Operational log and reports the count of event entries of different severity levels - informational messages, warnings, error messages, critical messages, and verbose messages. By tracking the volume and nature of these message types, this test enables administrators to quickly identify abnormal patterns, troubleshoot instability faster, and proactively investigate events that can impact business workloads hosted on the cluster.

This test is disabled by default. To enable the test, go to the enable / disable tests page using the menu sequence : Agents -> Tests -> Enable/Disable, pick the desired Component type, set Performance as the Test type, choose the test from the DISABLED TESTS list, and click on the << button to move the test to the ENABLED TESTS list. Finally, click the Update button.

Target of the test : Unix/Windows server

Agent deploying the test : An internal agent

Outputs of the test : One set of results for the filter configured

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed.

Host

The IP address of the host for which the test is being configured.

Port

Specify the port at which the target host listens to.

Logtype

Refers to the type of event logs to be monitored.

Policy based filter

Using this page, administrators can configure the event sources, event IDs, and event descriptions to be monitored by this test. In order to enable administrators to easily and accurately provide this specification, this page provides the following options:

  • Manually specify the event sources, IDs, and descriptions in the FILTER text area, or,

  • Select a specification from the predefined filter policies listed in the FILTER box

For explicit, manual specification of the filter conditions, select the NO option against the POLICY BASED FILTER field. This is the default selection. To choose from the list of pre-configured filter policies, or to create a new filter policy and then associate the same with the test, select the YES option against the POLICY BASED FILTER field.

Filter

If the POLICY BASED FILTER flag is set to NO, then a FILTER text area will appear, wherein you will have to specify the event sources, event IDs, and event descriptions to be monitored. This specification should be of the following format: {Displayname}:{event_sources_to_be_included}:{event_sources_to_be_excluded}:{event_IDs_to_be_included}:{event_IDs_to_be_excluded}:{event_descriptions_to_be_included}:{event_descriptions_to_be_excluded}. For example, assume that the FILTER text area takes the value, OS_events:all:Browse,Print:all:none:all:none. Here:

  • OS_events is the display name that will appear as a descriptor of the test in the monitor UI;

  • all indicates that all the event sources need to be considered while monitoring. To monitor specific event sources, provide the source names as a comma-separated list. To ensure that none of the event sources are monitored, specify none.

  • Next, to ensure that specific event sources are excluded from monitoring, provide a comma-separated list of source names. Accordingly, in our example, Browse and Print have been excluded from monitoring. Alternatively, you can use all to indicate that all the event sources have to be excluded from monitoring, or none to denote that none of the event sources need be excluded.

  • In the same manner, you can provide a comma-separated list of event IDs that require monitoring. The all in our example represents that all the event IDs need to be considered while monitoring.

  • Similarly, the none (following all in our example) is indicative of the fact that none of the event IDs need to be excluded from monitoring. On the other hand, if you want to instruct the eG Enterprise system to ignore a few event IDs during monitoring, then provide the IDs as a comma-separated list. Likewise, specifying all makes sure that all the event IDs are excluded from monitoring.

  • The all which follows implies that all events, regardless of description, need to be included for monitoring. To exclude all events, use none. On the other hand, if you provide a comma-separated list of event descriptions, then the events with the specified descriptions will alone be monitored. Event descriptions can be of any of the following forms - desc*, or desc, or *desc*,or desc*, or desc1*desc2, etc. desc here refers to any string that forms part of the description. A leading '*' signifies any number of leading characters, while a trailing '*' signifies any number of trailing characters.

  • In the same way, you can also provide a comma-separated list of event descriptions to be excluded from monitoring. Here again, the specification can be of any of the following forms: desc*, or desc, or *desc*,or desc*, or desc1*desc2, etc. desc here refers to any string that forms part of the description. A leading '*' signifies any number of leading characters, while a trailing '*' signifies any number of trailing characters. In our example however, none is specified, indicating that no event descriptions are to be excluded from monitoring. If you use all instead, it would mean that all event descriptions are to be excluded from monitoring.

By default, the filter parameter contains the value: all:all:none:all:none:all:none. Multiple filters are to be separated by semi-colons  (;).

Note:

The event sources and event IDs specified here should be exactly the same as that which appears in the Event Viewer window.  

On the other hand, if the POLICY BASED FILTER flag is set to YES, then a FILTER list box will appear, displaying the filter policies that pre-exist in the eG Enterprise system. A filter policy typically comprises of a specific set of event sources, event IDs, and event descriptions to be monitored. This specification is built into the policy in the following format:

{Policyname}:{event_sources_to_be_included}:{event_sources_to_be_excluded}:{event_IDs_to_be_included}:{event_IDs_to_be_excluded}:{event_descriptions_to_be_included}:{event_descriptions_to_be_excluded}

To monitor a specific combination of event sources, event IDs, and event descriptions, you can choose the corresponding filter policy from the FILTER list box. Multiple filter policies can be so selected. Alternatively, you can modify any of the existing policies to suit your needs, or create a new filter policy. To facilitate this, a Click here link appears just above the test configuration section, once the YES option is chosen against POLICY BASED FILTER. Clicking on the Click here link leads you to a page where you can modify the existing policies or create a new one (refer to page 1). The changed policy or the new policy can then be associated with the test by selecting the policy name from the FILTER list box in this page.

Usewmi

The eG agent can either use WMI to extract event log statistics or directly parse the event logs using event log APIs. If the USEWMI flag is YES, then WMI is used. If not, the event log APIs are used. This option is provided because on some Windows systems (especially ones with service pack 3 or lower), the use of WMI access to event logs can cause the CPU usage of the WinMgmt process to shoot up. On such systems, set the USEWMI parameter value to NO. On the other hand, when monitoring systems that are operating on any other flavor of Windows (say, Windows 2012 or above), the USEWMI flag should always be set to ‘Yes’.

DD Frequency

Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Information messages

Indicates the number of information events that were generated during the test's last execution.

Number

A change in value of this measure may indicate infrequent but successful operations performed by one or more applications.

Warnings

Indicates the number of warnings generated during the test's last execution.

Number

A high value of this measure indicates problems that may not have an immediate impact, but may cause future problems.

Error messages

Indicates the number of error events generated during the last execution of the test.

Number

A very low value (zero) is desired for this measure, as it indicates good health.

An increasing trend or a high value indicates the existence of problems.

Critical messages

Indicates the number of critical events that were generated when the test was last executed.

Number

A critical event is one that a system/application cannot automatically recover from.

A very low value (zero) indicates that the system/application is in a healthy state and is running smoothly without any potential problems.

An increasing trend or high value indicates the existence of fatal/irrepairable problems.

The detailed diagnosis of this measure describes all the critical events captured by the configured logtype during the last measurement period.

Verbose messages

Indicates the number of verbose events that were generated when the test was last executed.

Number

Verbose logging provides more details in the log entry, which will enable you to troubleshoot issues better.

The detailed diagnosis of this measure describes all the verbose events that were captured by the configured logtype during the last measurement period.