In this blog we’re going to look at the top 10 VMware performance metrics that you should be monitoring and explain the impact they have on the speed and responsiveness of your virtualized environments.
Did you know: VMware vSphere is by far the leading server virtualization technology ruling the hypervisor market with over 75% market share worldwide. This is far ahead than Microsoft and Citrix.
Virtualization technology is being widely adopted thanks to the flexibility, agility, reliability and ease of administration it offers. At the same time, any IT technology – hardware or software – is only as good as its maintenance and upkeep, and VMware vSphere virtualization is no different. With physical machines, failure or poor performance of a machine affects the applications running on that machine. With virtualization, multiple virtual machines (VMs) run on the same physical host and a slowdown of the host will affect applications running on all of the VMs. Hence, performance monitoring is even more important in a VMware vSphere virtualized infrastructure than it is in a physical infrastructure.
Performance Monitoring in a VMware Virtual Infrastructure is Even More Important than in a Physical Infrastructure
The performance of applications running on VMs in a VMware vSphere infrastructure depends on many factors :
- Physical resources from the underlying hosts are shared by VMs. If a few VMs consume excessive amount of resources (CPU, memory, disk), the other VMs may not have access to resources when they need them. This, in turn, affects application performance on the other VMs.
- Administrators can cap the resources available to VMs. If the caps are not correctly set, this can choke the performance of applications on these VMs.
- Administrators often over-commit resources on the physical hosts, since all VMs running on these hosts may not need the resources at the same time. While over-commitment ensures better utilization of hardware resources, administrators need to monitor actual utilization levels on the hosts to identify and correct situations where a physical host is resource-starved and as a result, the performance of VMs running on it is affected.
Over-allocation of resources to a VM is not the answer either. Firstly, over-allocation results in under-utilization of the underlying hardware, thereby yielding poor return on investment. Secondly, allocating too much CPU to a VM can cause it to stall waiting for sufficient CPU resources to be available, thereby affecting performance.
So, how does one determine what would be the right amount of resources to allocate to a VM? The answer to that question lies in tracking the resource usage of VMs over time, determining the norms of usage and then right-sizing the VMs accordingly.
But how does one track the resource usage metrics for VMs and which ones are important? VMware vSphere comprises many different resource components. Knowing what these components are and how each component influences resource management decisions is key to efficiently managing VM performance. In this blog, we will discuss the top 10 VMware performance metrics that every VMware vSphere administrator must continuously track.
Top 10 VMware Performance Metrics For VMware Admins
#1 Memory Ballooning in VMware Hypervisor
Memory ballooning is a memory reclamation technique used by the VMware hypervisor to allow the physical host system to reclaim unused memory from VMs, which means VMs that are experiencing a memory shortage can use the reclaimed memory.
Typically, the VMware vSphere hypervisor assigns a portion of the physical host’s memory to each VM. The guest operating system, which runs inside a VM, is unaware of the total memory available to the physical host. Memory ballooning makes the guest operating system aware of the host’s memory shortage. Whenever the physical host is faced with contention for memory, the balloon driver installed in the guest operating system determines whether unused memory can be reclaimed from any VM. The driver then inflates/balloons the memory resource on a VM that is under-utilizing its memory, and then, prompts the hypervisor to reclaim this unused memory from that VM. The hypervisor then makes this excess memory available to any memory-starved VM on the host.
Ballooning enables the efficient utilization of physical memory but at the cost of VM performance. This is because excessive memory ballooning on the hypervisor can cause the guest operating system to read from the disk. High disk I/O can degrade VM performance. To prevent excessive memory ballooning, administrators should continuously track how much memory the hypervisor is reclaiming from the VMs and ensure that it does not grow too close to the ballooning target that has been set. Monitoring the VMs and the guest operating system alone will offer little help in this regard. One must monitor ballooning at the hypervisor-level to proactively detect and control excesses.
#2 Memory Swapping
Memory swapping happens when the memory state of the VMware vSphere server is ‘hard’ or ‘low.’ VMware vSphere memory switches to one of these states when reclamation techniques such as ballooning, page sharing, and compression have been unable to keep pace with the rate at which VMs allocate memory. At this juncture, vSphere resorts to memory swapping.
Swapping happens at the guest OS and hypervisor-levels.
- In hypervisor-level swapping, memory pages on the VMs are swapped out to a swap space on the hypervisor. Each VM is linked to its own swap space. When the guest operating system accesses a memory page from the swap space, vSphere handles the access by swapping in that page from the swap space. vCPU waits can increase during swap-in operations, causing a negative impact on VM performance. Moreover, insufficient swap space can also degrade VM performance.
- In guest OS-level swapping, every time CPU accesses a virtual memory page on the guest OS, that memory page is swapped into physical memory. This way, virtual memory pages that are frequently accessed become available in physical memory, so that they can be served quickly. Memory pages that are seldom used are swapped to storage. With swapping therefore, there is the risk of high disk I/O and slow computation, owing to the frequent reads and writes, and the high rate of swapping between physical memory and storage.
Monitoring solutions focusing on VM performance alone will be able to capture VM slowness; but will not be able to diagnose its root cause. An ideal VMware monitoring solution is one that can track swap-in and swap-out rates at the hypervisor-level and at the guest operating system-level, auto-correlate these VMware performance metrics, and accurately pinpoint what is ailing VM performance. It’s also important to track the memory configuration and reservation per VM, as that will give administrators a fair sense of how much swap space is at a VM’s disposal.
#3 VM CPU Wait and VM CPU Ready
A VM’s virtual CPU (vCPU) can be in one of four basic states: run, wait, co-stop and ready.
From a performance monitoring standpoint, it is imperative that administrators know when and for how long a VM has been in the vCPU wait and ready states.
vCPU Wait Time
A VM waiting for a task to complete may not require its vCPU immediately. The time for which the VM kept its vCPU waiting for this purpose is the vCPU wait time. Typically, a VM could wait because it has nothing to do until an event occurs. For example, expiry of a network packet or a timer. This is called an idle wait. Highs and lows in the idle wait time are insignificant as they do not imply a problem condition. On the other hand, if the VM is waiting for a read/write on the storage to complete and cannot do anything else until it completes, it is called an I/O wait. Unlike idle waits, I/O waits have a performance impact. Longer the I/O wait, slower will be VM operations. I/O waits are also indicative of unavailable, overloaded, or latent storage. Hence, it is important that administrators track vCPU I/O wait time per VM.
vCPU Ready Time
vCPU ready time is the percentage of time a VM was ready but could not get a physical CPU to run on. One of the common causes for high vCPU ready time is over-subscription. If a VM is allocated more vCPUs than the physical CPUs (pCPUs) that are available on the host, then, during times of heavy load, when ideally, all vCPUs have to run full time, many vCPUs may not run for want of pCPUs. The result: The VM and applications running on it will run short of processing power, which, in turn, will degrade VM performance. Therefore, it is important to track the vCPU ready time of each VM. If this metric is over 5% for a VM, it indicates that the VM is slow. You can correlate this metric with the host’s CPU usage to figure out if there was a contention for physical CPU resources around the same time the vCPU ready time spiked. If so, you can conclude that the VM is over-subscribing to the host’s CPU resources. For corroboration, you can also monitor the number of pCPUs available to the host and the count of vCPUs allocated to each VM. This will point you to oversized VMs and prompt you to resize such VMs, so that vCPU ready time can be minimized. The recommended vCPU to pCPU ratio is between 1:1 and 3:1.
#4 Large and Old VM Snapshots
A snapshot captures the entire state of the virtual machine at the time the snapshot is taken. It includes the contents of the virtual machine’s memory, virtual machine settings, and the state of all the virtual disks of the virtual machine.
After a snapshot is taken, any change that needs to be made to the original virtual disk (VMDK) is first written to a growing snapshot file. Depending upon the level of activity on the VM, over time, this snapshot file can even grow to the size of the original virtual disk file. Where there are multiple snapshot files, their combined disk space usage can even exceed the size of the original virtual disk file. If enough disk space is not provisioned to a VM, then large snapshots can cause the snapshot storage location to run out of space, thereby adversely impacting VM performance. What is worse is that one or more hyper-active VMs using the same datastore can even spawn snapshot files that grow to consume the entire datastore space! This can severely hit the performance of all other VMs using that datastore. Therefore, administrators should keep an eye out for snapshot files that are abnormally large, check their contents to see if the changes they hold have been committed to disk already, and remove a snapshot file without any uncommitted changes, as such a file is no longer useful. This will help conserve storage space and ensure peak performance of VMs.
VMware also recommends that a snapshot file not be used for more than 72 hours. Besides unnecessarily hogging storage space, old snapshot files can also cause issues in version control for applications and VMs. To ensure that such snapshots do not affect VM performance, it’s best to continuously track the age of snapshot files, isolate the old/obsolete ones, and remove them.
#5 Idle/Orphaned VMs
Idle/zombie VMs are those VMs that remain running and continue consuming valuable CPU, memory, and storage resources, even though they are no longer used. For example, let’s say a VM is assigned to an employee, who later resigns. But if that VM is neither decommissioned nor assigned to another user subsequently, that VM becomes an idle VM.
Orphaned VMs are those that exist as data in the VMware vCenter server database but have either been deleted or are no longer registered with the host. Sometimes, a single VMDK disk or individual files can be orphaned. Some of the common causes for this unwanted scenario are:
- A host failover or DRS migration that failed.
- Removal of a VM from inventory when connected directly to the VMware vSphere server instead of through VMwVMware vCenter.
- Restoration of a vCenter server or its database from a backup or a snapshot.
Both idle VMs and orphaned VMs unnecessarily drain physical resources, causing the performance of active VMs to suffer. Moreover, the proliferation of such VMs results in a virtualization or VM sprawl – a condition where the count of VMs reaches unmanageable proportions. Monitoring the count and status of VMs on a host will help administrators isolate and reclaim unused resources and enable them to effectively manage VM performance.
#6 VM Disk Read/Write IOPS and Throughput
The most common yet accurate indicators of virtual disk health are disk throughput and disk IOPS. The level of throughput a virtual disk can deliver and the number of read/write operations it can support in one second determines how quickly the virtual disk can process commands or I/O requests. If a virtual disk is not sized with adequate throughput or I/O processing power, the VM using that virtual disk and the applications operating on that VM will experience significant slowness. Moreover, if a VM/application sends more throughput than its virtual disk is configured to support, it increases the pressure on vCPU and virtual memory of that VM. This in turn can cause the VM to suck more physical CPU and memory, thereby causing other VMs to contend for limited physical resources. This can result in a degradation in the performance of other VMs as well. To preempt such adversities, administrators should closely monitor throughput and IOPS on each virtual disk, time-correlate these values with CPU and memory usage, and proactively determine if the storage must be resized.
#7 Datastore Capacity Usage and Availability
VMware vSphere uses datastores to store all files associated with its virtual machines. A datastore is a logical storage unit that can use disk space on one physical device, one disk partition, or span several physical devices.
Without a datastore, VMware vSphere cannot provision VMs. If a datastore becomes unavailable suddenly, then users will be denied access to all VMs/applications using that datastore. To assure users of uninterrupted access to their VMs/applications, administrators should keep tabs on datastore status, promptly detect its unavailability, quickly isolate its root-cause, and fix it.
Excessive usage of the disk space in datastores can also result in significant degradation in VM performance. If more than 75% of a datastore’s disk space is utilized, it signals a potential ‘fight-for-space’ among the VMs sharing that datastore. In such situations, administrators should quickly identify the VM that is hungry for space and understand why it is consuming space selfishly. If not, this can cause other VMs using the same datastore to suffer serious performance setbacks.
Issues related to datastore availability and space usage become more pronounced where the datastore is configured on external storage such as SAN/NAS arrays. The reason is, in this case, a misconfiguration or a snag in the internal operations or a loss of communication with the underlying external storage device can also impact datastore health. Administrators should therefore be able to monitor the individual storage arrays along with the VMs and datastores, intelligently correlate issues across the virtualization and storage tiers, and accurately isolate where the bottleneck lies.
#8 VM Network Connectivity
When a user complains that a VM is inaccessible or slow, the reason may not always be because the VM has been powered-off or is experiencing an internal resource contention. Often, such issues can be attributed to a momentary/prolonged break in the network connection or a latent network connection to the VM. Therefore, monitoring the internal health of the VM alone will not suffice. It is also important that administrators monitor the connectivity to each VM from an external perspective. This perspective is more useful when the users to your virtualized environment come from different geographies! External connectivity monitoring in such environments will point administrators to specific geographies that are faced by persistent connectivity issues. Tracking the status and performance of virtual switches and virtual ports also help troubleshoot connectivity issues effectively.
#9 Hardware Health
Failure of hardware deals a fatal blow to the health of a vSphere host and the VMs. Processors that are down, fans that have stopped running, sudden and significant spikes in temperature/voltage of hardware, memory partitions that have failed etc., can instantly and grievously injure a physical host, bringing down both the host and the VMs on it. Prompt detection of and speedy recovery from hardware errors are hence crucial.
Like host hardware, the VM’s hardware status should also be tracked, as hardware failures experienced by a VM can adversely impact VM availability and performance.
#10 VM Resource Usage (Inside and Outside View)
Monitoring resource usage of VMs from the hypervisor, i.e., from ‘outside’ the VMs – will point you to the resource-starved VMs on a host. However, to know why a VM is consuming resources excessively, it is important that administrators measure the performance of a VM from inside the VM, i.e., tracking how a VM utilizes the CPU, memory, network, and disk resources allocated to it. This will point administrators to the root-cause of resource contentions at the VM level.
The most common approach to monitoring VM resource usage is installing a monitor/agent on each VM. This approach is not recommended as it is time-consuming and escalates costs. Ideally, a monitoring solution should be able to provide deep-dive insights into the internal operations and resource usage of individual VMs without requiring a monitor/agent per VM.
These VMware performance metrics are just the tip of the VMware monitoring iceberg! For best VM performance, administrators may also want to track the status and version of the VMware Tools installed on each VM. The usage of GPUs and vGPUs should also be monitored so that both the physical host and VMs are sized with the right number of GPU resources. Monitoring the uptime of the VMs and host can help capture unscheduled reboots. TCP connections to VMs should also be tracked so that connection drops and retransmissions can be instantly detected and investigated. And the list goes on! It is important that administrators continuously collect these VMware performance metrics and analyze them, as such analysis could shed light on potential performance issues and could enable administrators to resolve the issues before they become business-impacting.