eG Alarm Manager
The Limitations of Fixed, Manual Thresholding
Thresholds are upper and lower bounds that determine whether a metric is performing to expectation or not. Every time the actual value of the metric falls outside the prescribed limits, the monitoring system raises an alarm.
Typically, administrators have to define the thresholds for each metric collected by a monitoring system. Since there are thousands of metrics for a large infrastructure, manually setting each and every metric can be a laborious, cumbersome process. For metrics like availability and response time, administrators can set fixed thresholds, based on their service level expectations and agreements. For other metrics that are not bound by SLAs, the value of these metrics can be time-varying. For example, consider the number of users connected to a web server. The value of this metric varies with time of day (there are more connections during the day and fewer during the night) and even day of the week. Hence, it is not feasible to have a single, fixed threshold for such time-varying metrics.
Capabilities of the eG Alarm Manager
The eG Alarm Manager greatly simplifies monitoring and management of large infrastructures through its three main capabilities
The eG Alarm Manager embeds an intelligent thresholding engine that has been designed to handle both metrics with fixed values and those that vary with time of day. For service quality metrics (e.g., availability, response time), the eG Alarm Manager allows administrators to set multiple fixed thresholds different thresholds can be set corresponding to the value of a metric, so proactive alarms can be generated when the metric is slightly out of conformance, and a severe alarm is generated when the problem worsens.
For other metrics, the eG Alarm Manager computes time-varying thresholds automatically. The automatic threshold computation is done using tried and tested statistical quality control techniques to analyze past values of the metrics and to automatically set the upper and lower bounds for each of the metrics, using the historical data. Since the values of the metrics vary from time to time, the historical thresholds are also time-varying.
The key benefits of this approach are:
- Users have the flexibility to choose between fixed and automatic thresholds for each and every metric
- Threshold configuration can be completely automated
- No need for continuous tuning of thresholds as the IT infrastructure evolves
The eG Difference
- A wide choice of threshold policies - Set fixed, automatic, time-varying, or no thresholds at all to suit existing infrastructure needs
- Flexible, granular alarm policy definitions that can be customized for the target infrastructure, ensuring fewer false alarms
- Out of the box correlation capability focuses administrators' attention on the real cause of problems, avoids them being defocused by the effects of the problems
- Simple to configure correlation engine requires minutes to configure, not months to customize
- Personalized alerts ensures that administrators only receive alerts pertaining to infrastructure elements in their domain of responsibility
eGs automatic, time-varying thresholding approach applied to
the user connections metric of a web server
Flexible Alarm Policies
While a threshold policy determines how the thresholds for a metric are computed, an alarm policy determines when alarms are to be generated to inform administrators about a problem. Depending on their criticality, different metrics may require different alarm policies. For instance, an instantaneous surge in the CPU usage of a system is a natural phenomenon in a production system. On the other hand, even a sporadic unavailability of a critical network router is a critical event that needs to be informed to the administrator. Alarm policies must also take into account the frequency of threshold violations of a metric. E.g., while an instantaneous surge of the CPU usage is not a cause for concern, a prolonged set of surges of the same metric may indicate a problem situation that must be corrected.
Automatic prioritization of alarms depending on their criticality
To accommodate different types of metrics, the eG Alarm Manager offers administrators complete flexibility in setting alarm policies. Administrators can set individual alarm policies for each server, or each server group, or per server type (e.g., web server, database, application server, etc). Each alarm policy is defined by two parameters the period over which threshold violations are observed before generating an alarm, and the minimum number of threshold violations necessary in this period to trigger an alarm.
Many a times, a single problem can trigger a number of side-effects. During such times, administrators receive a large number of network alarms and often struggle to figure out where to begin problem diagnosis. In such scenarios, the eG Alarm Manager provides administrators a head start. The patented event correlation engine embedded in the eG suite analyzes the measurements provided to it by eG agents in real-time, assesses the inter-dependencies between infrastructure components, and automatically prioritizes alarms into different levels of criticality. This capability is ideal for multi-domain environments where finger-pointing between operations staff is prevalent, resulting in longer downtimes and reduced customer satisfaction.
eG's integrated and automatic correlation approach
Key Benefits of the eG Alarm Manager
- Easy to provision - Have the monitoring system up and running in hours, not days!
- Allow the monitoring system to automatically learn the baseline performance of the infrastructure and provide alerts when anomalies are detected
- Automatic prioritization of alarms lets administrators focus on the key problems
- Receive personalized, proactive alerts any where, at any time over email, SMS, or the web