![]() |
The Limitations of Fixed, Manual ThresholdingThresholds are upper and lower bounds that determine whether a metric is performing to expectation or not. Every time the actual value of the metric falls outside the prescribed limits, the monitoring system raises an alarm.Typically, administrators have to define the thresholds for each metric collected by a monitoring system. Since there are thousands of metrics for a large infrastructure, manually setting each and every metric can be a laborious, cumbersome process. For metrics like availability and response time, administrators can set fixed thresholds, based on their service level expectations and agreements. For other metrics that are not bound by SLAs, the value of these metrics can be time-varying. For example, consider the number of users connected to a web server. The value of this metric varies with time of day (there are more connections during the day and fewer during the night) and even day of the week. Hence, it is not feasible to have a single, fixed threshold for such time-varying metrics. Capabilities of the eG Alarm ManagerThe eG Alarm Manager greatly simplifies monitoring and management of large infrastructures through its three main capabilities:Intelligent ThresholdsThe eG Alarm Manager embeds an intelligent thresholding engine that has been designed to handle both metrics with fixed values and tho`se that vary with time of day. For service quality metrics (e.g., availability, response time), the eG Alarm Manager allows administrators to set multiple fixed thresholds – different thresholds can be set corresponding to the value of a metric, so proactive alarms can be generated when the metric is slightly out of conformance, and a severe alarm is generated when the problem worsens.
For other metrics, the eG Alarm Manager computes time-varying thresholds automatically. The automatic threshold computation is done using tried and tested statistical quality control techniques to analyze past values of the metrics and to automatically set the upper and lower bounds for each of the metrics, using the historical data. Since the values of the metrics vary from time to time, the historical thresholds are also time-varying.
Flexible Alarm PoliciesWhile a threshold policy determines how the thresholds for a metric are computed, an alarm policy determines when alarms are to be generated to inform administrators about a problem. Depending on their criticality, different metrics may require different alarm policies. For instance, an instantaneous surge in the CPU usage of a system is a natural phenomenon in a production system. On the other hand, even a sporadic unavailability of a critical network router is a critical event that needs to be informed to the administrator. Alarm policies must also take into account the frequency of threshold violations of a metric. E.g., while an instantaneous surge of the CPU usage is not a cause for concern, a prolonged set of surges of the same metric may indicate a problem situation that must be corrected.
To accommodate different types of metrics, the eG Alarm Manager offers administrators complete flexibility in setting alarm policies. Administrators can set individual alarm policies for each server, or each server group, or per server type (e.g., web server, database, application server, etc). Each alarm policy is defined by two parameters – the period over which threshold violations are observed before generating an alarm, and the minimum number of threshold violations necessary in this period to trigger an alarm. Automatic CorrelationMany a times, a single problem can trigger a number of side-effects. During such times, administrators receive a large number of alarms and often struggle to figure out where to begin problem diagnosis. In such scenarios, the eG Alarm Manager provides administrators a head start. The patented correlation engine embedded in the eG suite analyzes the measurements provided to it by eG agents in real-time, assesses the inter-dependencies between infrastructure components, and automatically prioritizes alarms into different levels of criticality. This capability is ideal for multi-domain environments where finger-pointing between operations staff is prevalent, resulting in longer downtimes and reduced customer satisfaction.
|
|||||||||||||||||||||||||||||||