Virtualization technologies have changed the ground rules on monitoring and managing your IT services.
Most of us in the IT operations world tend to focus on the nuts and bolts of the infrastructure - our key concerns are: How hot are my Linux servers? How many IOPS are happening to the disks? Is Active Directory working? Is DNS working, etc.?
On the other hand, end users are focused on the business services that they are accessing. After all, that's what they see and care about to get their jobs done. So a user complaint always relates to the service - bill payment is not working, my CRM service is slow, my online reservation crashed, etc.
Clearly, there is a disconnect between the end users with their focus on the business and the IT operations staff with their IT focus. This disconnect threatens the success of any IT infrastructure transformation initiative to deliver on the user experience and ROI promise. The disconnect is partly caused by the way IT operations teams are organized and managed. A single business service involves multiple infrastructure silos - the firewall tier, the web server tier, the database tier, the server tier, the application tier, and so on. Virtualization is yet another tier added to the mix.
In most organizations, these tiers are staffed by different administrators. These administrators use different, independent tools for administering and monitoring their silos. When a user calls and complains about a service slowdown, very often IT ?using a variety of separate silo tools ? cannot quickly diagnose the problem.
The IT service manager then often has to guess where the problem originates - is it the network, database, application, storage, or the virtualization tiers? Finding the root-cause of the problem is the first step to remediating the problem in order to bring the service back to normal. Often, troubleshooting is the most time-consuming and expensive step in the incident management process.
Finding the root-cause of a problem, especially in a virtual infrastructure, is easier said than done. Here's why: let's take the example of a business service delivered over a typical multi-tier infrastructure - the user on the left-hand side connects through a firewall to a web server. The user request is forwarded to a middleware application tier, which then uses a database server to service the request. Suppose the database tier has become 50% slower than normal? Since the application server depends on the database, a slowdown of the database impacts the application server. In turn, the application server problem impacts the web server and, ultimately, the user experience.
Evidently, a single problem in one tier can impact all the other tiers and the IT service manager then has to determine where the problem originated. If all the applications were hosted on physical servers, using the inter-dependency information, we could have fairly quickly concluded that the database tier is the root-cause of the problem.
Let's now look at how the use of virtualization makes the diagnosis problem more challenging. Suppose the virtualized Oracle database server from the previous example is hosted on the same physical machine as a virtualized Citrix and a media server. If suddenly there are a lot of requests to stream videos from the media server, this can cause a lot of requests to the server's disk, choking the physical server. This in turn will impact the performance of the virtualized database server sharing the same physical resource, and result in poor performance for database queries.
Here is an example in which an application (namely, the media server) that has no relation to the business service we are looking at is impacting the service. Virtualization has introduced a new type of dependency - since all the VMs running on a physical server share the resources of the server, a single malfunctioning VM can impact the performance experienced by all the other VMs.
In our example, a performance management system has to be virtualization-aware to arrive at the right conclusion and fix the root cause. In this case, it is not the database server that is the cause of the problem. It is really the media server that is the root-cause of the problem.
We haven't even started talking about live migration by which VMs can be dynamically moved from one physical machine to another. Traditionally management systems have been designed to monitor infrastructures where applications run on the same machine. With live migration, an application can be moved from one physical machine to another in real-time. Management systems now have to take care of this new dynamic nature as well.
Traditional static approaches to performance monitoring and management are blind to this new reality. While management of business services that spanned multiple physical tiers was already a challenge, the introduction of virtualization has made the problem much, much harder.
In addition to monitoring every layer of every tier and automated root-cause diagnosis, performance management automation can also help you determine the baselines of your infrastructure, so any time when the actual usage level deviates from the norm, you receive pre-emptive alerts that help you identify and resolve issues even before users notice them.
Another value of an automated management solution is that it allows you to right-size your infrastructure so you can invest wisely. Its analytics and reports help you pinpoint exactly where the resource bottlenecks are and how you can alleviate them.
A virtualization-aware performance management solution can provide several benefits. You can deliver great experience for users and enhance the uptime of key services. You can do this without compromising on ROI. Above all, you can accelerate adoption rates and the success of the new transformational IT technologies like virtual servers, virtual desktops and cloud computing.