Importance of Proactive Alerting

When evaluating a monitoring product, it is essential you fully understand its alerting capabilities. Alerting is a responsive action triggered by a change in conditions within the system being monitored. Typically, an alert can be defined by a condition to trigger the alert and an action defining what that alert should do when the trigger condition occurs. If your monitoring tools are intended to ensure high availability and performance, rather than retrospective passive analysis of metrics – you need very good proactive alerting.

IT managers often complain about two main types of problems with monitoring and management tools, and they are both fundamentally associated with weak and inadequate alerting features:

  • Firstly, after they install the software, they start to receive many “false” alerts. A false alert refers to a situation in which the monitoring tool indicates a problem, but the IT manager determines that there is no real problem in the network. Thousands of alerts can result in distracting IT administrators, preventing them from focusing on the genuine issues that can impact IT service quality or working on more interesting strategic projects. This is commonly referred to as ‘noise’. Alert storms are at the extreme of this spectrum.
  • Secondly, to avoid false alerts, IT managers must define threshold values for the different metrics collected by the monitoring tool. A threshold is a limit set in the monitoring tool for the metric, so that if a metric crosses this value, an alert is raised. In a large enterprise, a monitoring tool that provides visibility into the different network, server, and application tiers can collect millions of metrics. Having to set thresholds manually for every single metric is a very time-consuming, monotonous exercise. As a result, many enterprises end up spending a lot of time and money having consultants calculate, assess, and tune thresholds manually or have to invest in bespoke scripting to attempt to automate parts of this tedium.

Modern Alerting Technologies

Modern monitoring platforms such as eG Enterprise simply don’t have these issues as they provide out-of-the-box alerting coupled with sophisticated machine learning AIOps engines. This functionality can calculate both static and dynamic thresholds and correlate events with alerts to minimize noise, avoiding secondary symptomatic (rather than root-cause alerts) and event storms.

Modern alerting technologies now automatically deploy and auto-scale with dynamic infrastructures and IaC (Infrastructure-as-Code) workflows and process, check and adjust thousands or even millions of metric thresholds. Crude static fixed threshold values and percentages have been replaced by systems that understand the hour-by-hour, day-of-the-week, monthly, and seasonal usage variations of systems to enable anomaly detection without problematic alert storms and false positives.

In the past organizations often spent several $100k on a monitoring tool, where the administrators ended up exporting large volumes of historic metric data into excel spreadsheets or “stress level” calculators to try to figure out what thresholds should be. This type of crude static threshold methodology is rarely cost-effective or effective, and has become unnecessary with modern monitoring platforms.

With newer workspace technologies such as Amazon WorkSpaces and Azure Monitor for AVD leveraging both static and AI-driven dynamic thresholds, awareness has increased of modern alerting functionality that avoids unscalable manual configuration. Many of the customer RFIs and PoCs which evaluate eG Enterprise, now specifically and explicitly assess and score thresholding and alerting capabilities. We are increasingly seeing requests for architectural details of alerting capabilities and are pleased to release a new whitepaper to satisfy this need.

This new whitepaper will help you understand the concepts underpinning alerting methodology to assess alerting capabilities such as window-sizing and event duration.

Understanding Metric Thresholds and Alerting Features

In this new white paper, we cover the architectural qualities associated with eG Enterprise’s thresholding methodologies to automate and optimize thresholding and alerting. This white paper will enable readers to evaluate and compare the thresholding capabilities of monitoring solutions for a range of use cases including APM (Application Performance Monitoring), BTM (Business Transaction Monitoring), Digital Workspace monitoring (Citrix / VMware) and Cloud Monitoring (Azure Monitor, Amazon CloudWatch). The capabilities discussed will cover:

  • When to choose static vs. dynamic thresholds
  • How to combine static and dynamic thresholds
  • How machine learning technologies within AIOps engines have become standard to calculate and set alerting thresholds automatically for large-scale auto-scaled infrastructure
  • Threshold priorities and multi-level alerting – automatically escalating the priority of alerts as metrics change, rather than triggering additional alerts
  • Threshold sensitivities – how thresholds can be set to be 20% of a value or even 20% of normal usage for that metric at a particular time of day
  • Threshold duration considerations – how to ensure the time duration of events is included in alerting criteria e.g., a CPU spike of 100% for 2 seconds could be ignored unless the problem continues but the failure of a hard-disk should be reported instantaneously
  • Leveraging reports on alerting to set KPIs and identify optimizations and continual improvements
  • How to configure alerting to handle planned maintenance
  • Integrating alerts into ITSM help and service desk systems such as ServiceNow and Autotask

Reports and Dashboards can be used to understand and quantify alerts and assess IT’s effectiveness

White paper – Understanding How Combining Static and Dynamic Thresholds Avoid False Alarms

The white paper guides readers through complex examples of thresholds best practice. For example – how to avoid false alarms around Citrix user sessions. A performance graph showing the number of user sessions and the auto-computed, dynamic-static thresholds used for alerting.

An auto-static combination threshold is applied to this metric. In the morning hours, a static threshold is applied because the dynamic threshold is lower. The static value ensures that alerts are not generated as long as the number of sessions stays below 10. During the day (8am onwards), the automatic threshold takes over. The blue line in the figure denotes the metric’s value over time. The yellow lines represent the upper threshold values. Notice that from 4pm to 8am, the threshold is static – with the minor value at 10 sessions and the major value at 15 sessions. Since the automatically computed value is less than both thresholds, the statically set threshold values apply in this case. As in the case with the maximum thresholds, if a static minimum and an automatic minimum threshold are specified, then eG Enterprise will generate alarms only when the current value falls below the lower of the two threshold settings.

After reading our white paper you should be left in a position to understand all common thresholding methodologies, algorithms and features available from modern monitoring platforms. Enjoy!

Further Information: