Correlation and root-cause diagnosis have always been the holy grail of IT performance monitoring. Instead of managing a flood of alerts, correlation and root-cause diagnosis help IT administrators determine where the real cause of a problem lies and work to resolve this quickly, so as to minimize user impact and business loss.
However, all this is not as simple as it sounds! There has always been confusion around event correlation. Terms like event storms, false alerts, root-cause, correlation, analytics, and others are used by vendors with reckless abandon. The result for customers can be a a lot of confusion.
As a result, I’ve always liked some of the best practice guidance around monitoring and event management. Event management’s objectives – detect events, make sense of them and determine the appropriate control action – can be a good way to understand these concepts, and breaking things down along these lines can help us understand what can be a complex subject.
Data Collection and Detecting Events
This is exactly what it sounds like. Monitors need to collect all sorts of data: From devices, applications, services, users and more. This data can be collected in many ways, by many tools and/or components.
So, the first questions you’ll need to ask include “what are we trying to achieve?” and “what data do we need to collect?” Most customers ask these questions based on their individual perspective – a device, a technology silo, a supporting IT service, a customer-facing IT service or perhaps even a single user.
This perspective can result in a desire for a LOT of monitoring data, where the customer is attempting to “cover all bases” by collecting all the data they can get their hands on – “just in case.” Freeware or log monitoring can be simple, cost-effective ways of collecting large amounts of raw data.
But beware of the perception that data collection is where most of your monitoring costs are. In truth, more data is not necessarily a good thing and data collection is not where the real costs to your organization lie, in any case. To determine real causes of performance problems, you’ll have to balance your desire for fast and inexpensive data collection with your needs for making sense of events.
Making Sense of Events with Event Correlation
This is where processing and analyzing the data that you’ve collected occurs, and where you ask questions like “how frequently should we collect data?” “what format does the data need to be in?” and “what analysis do we need to perform?”
For example, do you just require information about what’s happening right now, or will you require history (e.g. to identify trends, etc.)? How granular does the collected data need to be, meaning, what are you going to do with it (e.g. identify a remediation action, etc.)? Putting some thought into your monitoring objectives is an important element of determining a data collection strategy.
Most monitoring products today provide some level of formatting and reporting of monitoring data in the form of charts and graphs. This is where the use of terms such as “root-cause analysis” or “correlation” are often used. The key question here is who is doing the analysis and how. Relying on highly skilled experts to interpret and analyze monitoring data is where the real costs of monitoring come from, and this is where the confusion really begins – event correlation, or, making sense of events.
Approaches to Event Correlation
Most customers assume that when they hear “root-cause analysis” or “correlation” there is some level of automation occurring, but relying on your IT staff to interpret log files or graphs is, clearly, manual analysis or correlation. Manual correlation is time-consuming, labor-intensive, requires expertise, and is not scalable as your infrastructure grows. Herein lies the need for monitoring tools that automate this process.
There are many common approaches to correlation:
A common and traditional approach to event correlation is rule-based, circuit-based or network-based. These forms of correlation involve the definition of how events themselves should be analyzed, and a rule-base is built for each combination of events. The early days of network management made use of many of these solutions. As IT infrastructures have evolved, the amount of data collected and the effort required for building rules to account for every possible event combination makes this approach very cumbersome. The challenge with this approach is that you must maintain the rule-base, and with the dynamic nature of today’s environments this is becoming increasingly difficult.
Another approach is to learn from past events. Terms like “machine learning” and “analytics” have been used for this approach. What’s common is learning behavior from past events, and if these patterns re-occur you can quickly isolate where the issue was the last time it was observed.
These approaches are independent of the technology domain, so no domain knowledge is needed. This may limit the level of detail that can be obtained, and if the patterns have not been learned from experience, then no correlation will take place. The drawback, of course, is that when problems occur in in the software layers, many of the event patterns are new. Furthermore, the dynamicity of today’s environments makes it less likely that these problem patterns will reoccur.
These approaches use terms like “embedded correlation.” This approach does not use rules per se, but organizes the measurement data using layered and topology-based dependencies in a time-based model. This enables the monitored data to be analyzed based on dependencies and timing of the events, so the accuracy of the correlation improves as new data is obtained.
The advantage of this approach is that users can get very specific, granular, actionable information from the correlated data without having to maintain rule bases or rely on history. And since virtual and cloud infrastructures are dynamic, the dependencies (e.g. between VMs and physical machines) are dynamic. So the auto-correlating tool must be able detect these dynamic dependencies to use them for actionable root-cause diagnosis.
The Devil’s Often in the Details
How far should root-cause diagnosis go? This depends on the individual seeking to use the monitoring tool. For some, knowing that the cause of slowness is the high CPU usage of a Java application may be sufficient; they can simply pass the problem on to a Java expert to investigate. On the other hand, the Java expert may want to know which thread and which line of code within the application is causing the issue. This level of diagnosis is desired in real-time, but often, the experts may not be at hand when a problem surfaces. Therefore, having the ability for the monitoring tool to go back in time and present the same level of detail for root-cause diagnosis is equally important.
The level of detail can be the difference between an actionable event and one that requires a skilled IT person to further investigate.
Determine the Appropriate Control Action (Automated IT Operations)
This is where some organizations are focusing, sometimes with limited-to-no evaluation of the monitoring environment. Many IT operations tasks can be automated with limited concern for monitoring, such as automating the provisioning process or request fulfillment.
But if your goal is to automate remediation actions when issues arise, this will involve monitoring. The event management process tends to trigger many processes such as Incident, Problem, Change, Capacity and others. But before you automate remediation tasks, you’ll need to have a high degree of certainty that you’ve correctly identified the root-cause.
The level of detailed diagnostics is relevant here, since without specific detail you may only be able to automate very simple remediation actions (re-start a server, etc.).
As you begin to populate operational remediation policies (also sometimes called rules), you will need to ensure that you can effectively maintain these policies. Therefore, rule-based correlation approaches can come with risk. Failure to maintain the correlation rules can obsolete the policy rules. Solutions that can correlate to a high degree of accuracy, as well as eliminate or simplify correlation maintenance, can be an advantage here.
Automated remediation can be a more significant driver of cost savings than simple data collection, but requires us to make sense of events before effectively achieving this goal.
The Future of Root-Cause Analysis & Event Correlation
There’s no question that with the emergence of new technologies such as containers, microservices, IoT and big data that the monitoring world will need to continue to keep pace with complexity.
Advances in artificial intelligence and analytics will surely drive continued improvements in monitoring, and we hear a lot about advancements in these areas. But remember, we’ve been down this road before. If you do not understand how the monitor will work, or if it seems too good to be true, be sure and test it in your environment.
The increasing reliance of the business on IT services indicates a likelihood that the need for correlation intelligence that can pinpoint the cause of an issue will increase in importance over time.
So, if a solution touts the benefits of root-cause analysis and also provides you with a “war room” at the same time, or promises autonomic IT operations without explaining how it will get to an actionable diagnosis, don’t forget…
…the devil’s in the details.