Establishing an AIOps Strategy in 3 Practical Steps

 AIOps Strategy illustration
What is AIOps

In an earlier blog, I provided an introduction to AIOps. AIOps is the application of Artificial Intelligence to IT Operations. Many people misunderstand AIOps as replacing or mimicking human intelligence. This is not what AIOps is about. Rather, AIOps seeks to apply algorithms to solve specific problems, often much faster, much more accurately, and at much higher scale than a human ever could solve the problem.

As applications have become more distributed and complex, and as the infrastructure those applications run on gets more distributed and complex (often spanning from data centers to public cloud to edge computing) it has become untenable for applications to perform reliably and efficiently at scale without AIOps. Enterprises that adopt AIOps are finding that their employees are more productive and spend more time on innovation when AIOps is freeing them from troubleshooting and other fire-fighting activities.

This blog covers how to establish your AIOps strategy in just three steps.

1. Understand your Data Ingestion Requirements and Data Sources

Any AIOps strategy depends upon the quality and quantity of the data available to learn from. No matter how good the machine learning technologies and analysis capabilities included, the right data must be included, appropriately, to determine effective conclusions and understanding that can then be used to take the right decisions and actions. Otherwise, it will be a case of “garbage in, garbage out”.

AIOps Data Management Requirements diagram
How a monitoring agent collects performance metrics

There is no single source of data about the performance of IT applications and infrastructure. Network metrics can be obtained using SNMP. System metrics require agents on these systems and the use of WMI, perfmon, OS commands and such. Virtualization monitoring is using APIs supported by the respective vendors. Log files provide insights into application performance. AIOps tools will need valid and relevant metrics regarding every layer and every tier of the infrastructure. Make sure you pay attention to the level of detail collected and remember that every agent is not the same! It is almost as important that you collect the right metrics as it is to analyze these metrics correctly.

Data polling/sampling rates need to be set at appropriate levels to detect significant events but also avoid excessive sampling which adds to bandwidth costs and increases data set sizes without benefits and increases the processing needed.

Licensing may also play a part in your choice of toolset. Many cloud-native monitoring systems are volume of data processed or the associated storage/bandwidth of the metrics consumed. Licensing by sessions, hosts and infrastructure components is much easier and removes the need to cap data ingestion or closely watch variable monitoring costs via cost estimating calculators or similar.

2. Define the Key Business Benefits of AIOps and How to Quantify Success

Many organizations find it helpful to determine how they will measure success within their AIOps strategy. Here we have listed a few questions and metrics that you may find useful to help define how you quantify the ROI and benefits of your strategy.

Key Benefit How to Measure Success
Consolidate and ingest data sets from multiple data sources and analyze the overall system
  • What % of data is included within the overall platform
  • How many monitoring tools and user interface windows do staff have to work with
  • What data or processes are outside of the main systems.
Improved service delivery via quicker and more consistent incident resolution
  • Track metrics such as: Mean Time to Detect (MTTD), Mean Time to Investigate (MTTI) and Mean Time to Resolution (MTTR)
  • Set targets for consistency and the removal of unevenness (“Mura”) within processes; extreme events are often more troublesome than moderately higher than normal resolution times.
  • Track cost savings associated with reduced incidents and MTTR
Minimize and avoid downtime to maximize end-user/customer satisfaction and confidence
  • Monitor SLA (Service Level Agreement) targets and actual levels achieved
  • Track where proactive measures prevented users experiencing problems vs issues resolved after users became aware
Optimize hardware usage and software licenses
  • Record cost savings from avoiding unused licenses (“shelf-ware”)
  • Maintain KPIs for % capacity and consolidation
Removing mundane tasks and giving them good tools to do their jobs enables staff to concentrate on more interesting work and be more productive. This can lead to business benefits from higher morale, enthusiasm and lower staff turnover.
  • Baseline and monitor staff productivity
  • Staff satisfaction surveys
  • Monitor staff turn-over and reasons for staff churn

3. Evaluate the Key Capabilities of AIOps Solutions that You will Deploy

Any AIOps solution should offer a number of key capabilities. Determine which of these key capabilities is most important for you. This choice will depend on the key business benefits you have decided to focus on.

  • AIOps Tools integration illustration
    Key elements of an AIOps solution

    If you have a large number of tools already in place and your priority is to get the most out of your current investments, then focus on a “manager of managers solution” – one that will ingest data from multiple disparate tools and try to make sense of these diverse data sets. On the other hand, if you are seeking to gain better monitoring in a specific domain – e.g., for Java web applications, you should evaluate the breadth and depth of metrics that the target solutions provide in this domain. For example, can they monitor the JVM in-depth (memory, threads, blocking, etc.)?, can they provide code-level visibility?, etc.

  • If you are looking to make your IT operations proactive, you should focus on auto-baselining. If the AIOps tool can auto-compute baselines, you won’t have to spend time configuring every threshold setting and you can let the tool monitor and alert when abnormal situations baselining
    How auto-baselining works. The yellow lines represent the automatically determined baselines
  • If you are tired of seeing hundreds of alerts each day and would like to focus on just the alerts that matter, then event correlation and root-cause diagnosis capabilities matter to you. You are concerned about whether the monitoring tool can automatically prioritize between different alerts and highlight just the problems that your IT team needs to focus on.
  • Finally, if you find that the same issues happen again and again and you’d like to reduce the manual effort that your IT team has to perform each time, you will be interested in automation capabilities of the AIOps toolset to enable auto-correction and self-healing.


There’s been a lot of research and discussion around AIOps technologies. In this blog, I’ve discussed three practical steps in any organization’s AIOps strategy. In the next blog to follow, I will discuss how the eG Enterprise monitoring, diagnosis and analytics solution fits into an AIOps strategy.