Fault Summary Test

A fault is a mutable object that is managed by the Cisco UCS Manager. Each fault represents a failure in the Cisco UCS Manager or an alarm threshold that has been raised. The fault can change from one state or severity to another during its lifecycle. Each fault includes information about the operational state of the affected object at the time the fault was raised. If the fault is transitional and the failure is resolved, then the object transitions to a functional state.

The fault remains in the Cisco UCS Manager until the fault is cleared and deleted according to the settings in the fault collection policy. The fault collection policy controls the lifecycle of a fault in a Cisco UCS instance, including when faults are cleared, the flapping interval (the length of time between the fault being raised and the condition being cleared), and the retention interval (the length of time a fault is retained in the system). The fault, if not detected earlier, may cause the following types of problems:

  • service unavailability
  • power problem, thermal problem and voltage problem,
  • component configuration failures,
  • serious management issues,
  • poor adapter connectivity,
  • network issue such as link down,
  • log capacity issue or failed server discovery.

To prevent the above-said problems, the faults raised in the Cisco UCS Manager should be tracked at regular intervals and cleared before the operation of the Cisco UCS Manager comes to a halt! The Faults Summary test helps administrators in this regard!

This test monitors the faults raised in the Cisco UCS Manager and for each severity, this test reports the number of faults raised. Using this test, administrators can figure out which severity type of faults were raised at the maximum and take corrective measures to rectify the same.

Target of the test : A Cisco UCS Manager

Agent deploying the test : A remote agent

Outputs of the test : One set of results for the Cisco UCS Manager that is being monitored.

Configurable parameters for the test
Parameter Description

Test Period

How often should the test be executed

Host

The IP address of the host for which the test is being configured.

Port

The variable name of the port at which the specified host listens.

UCS User and
UCS Password

Provide the credentials of a user with at least read-only privileges to the target Cisco UCS Manager.

Confirm Password

Confirm the password by retyping it here.

SSL

By default, the Cisco UCS Manager is SSL-enabled. Accordingly, the SSL flag is set to Yes by default.

Web Port

By default, in most virtualized environments, Cisco UCS Manager listens on port 80 (if not SSL-enabeld) or on port 443 (if SSL-enabled) only. This implies that while monitoring Cisco UCS Manager, the eG agent, by default, connects to port 80 or 443, depending upon the SSL-enabled status of Cisco UCS Manager - i.e., if Cisco UCS Manager is not SSL-enabled (i.e., if the SSL flag above is set to No), then the eG agent connects to Cisco UCS Manager using port 80 by default, and if Cisco UCS Manager is SSL-enabled (i.e., if the SSL flag is set to Yes), then the agent-Cisco UCS Manager communication occurs via port 443 by default. Accordingly, the WebPort parameter is set to default by default.

In some environments however, the default ports 80 or 443 might not apply. In such a case, against the WebPort parameter, you can specify the exact port at which the Cisco UCS Manager in your environment listens, so that the eG agent communicates with that port for collecting metrics from the Cisco UCS Manager.

Show Info DD

Typically, if the Detailed Diagnosis flag is set to On for this test, then periodically, eG Enterprise collects the complete details of all the information received by the Cisco UCS Manager, and stores them in the database. This way, whenever a user clicks on the Diagnosis icon (magnifying glass icon) corresponding to the Information measures reported by this test in the monitoring console, eG Enterprise retrieves the relevant detailed diagnosis information from the database and provides it to the user.  In large environments however, the number of informational messages received on the Cisco UCS Manager will be quite huge. Naturally, the detailed diagnosis of such messages will also occupy a considerable amount of database space, which will only grow with time. In order to minimize the strain on the eG database, by default, the detailed diagnosis capability for the information events alone is turned off in the eG Enterprise system. Accordingly, the Show Info DD flag is set to False by default. However, you can this flag is set to True, so that detailed diagnosis is available for information events as well. 

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Critical Faults

Indicates the number of Critical fault events that occurred during the last measurement period.

Number

Ideally, value of this measure should be zero. A critical fault is a service-affecting condition that requires immediate corrective action. For instance, the critical severity could indicate that the managed object is out of service and its capability must be restored.

Major Faults

Indicates the number of Major fault events that occurred during the last measurement period.

Number

Ideally, the value of this measure should be zero. A major fault is a service-affecting condition that requires urgent corrective action. A significant increase in the value of this measure indicates that a severe degradation in the capability of the managed object.

Minor Faults

Indicates the number of Minor fault events that occurred during the last measurement period.

Number

Ideally, the value of this measure should be zero. A minor fault condition requires corrective action to prevent a more serious fault from occurring.

Warning Faults

Indicates the number of Warning fault events that occurred during the last measurement period.

Number

Ideally, the value of this measure should be very low. A warning fault is a potential or impending service-affecting fault that currently has no significant effects in the system. Corrective actions should be taken to further diagnose, if necessary, and correct the problem to prevent it from becoming a more serious service-affecting fault.

Information

Indicates the number of information/notifications that were received during the last measurement period.

Number

A basic notification or informational message, independently insignificant, provides the details on state changes and fault transitions. For more details about these messages, use the detailed diagnosis of this measure.