IT problems happen even in the best architected infrastructure due to configuration changes, failures, upgrades and such. How quickly and effectively you can detect and resolve such problems dictates how efficient your IT operation is. Today, I’ll cover how eG Enterprise helped us troubleshoot a hardware failure (a storage battery failure) that that caused a cascade of failures in a VMware ESXi infrastructure.

On Dec 5th 2025, an environment monitored by eG Enterprise started experiencing performance issues and multiple alerts across multiple components were triggered.

The Virtual Admin View

The VMware ESXi environment was configured to support hundreds of VMs hosting a variety of business-critical workloads including Tomcat application servers and databases.

eG Enterprise provides in-depth monitoring for VMware ESXi servers. Using VMware APIs, monitoring is performed in an agentless manner. The metrics collected cover all key components of a VMware server including the hardware, storage LUNs, datastores, hypervisor memory and CPU usage, physical interfaces, virtual switches, virtual machines and their resource usage levels, etc. More details of eG Enterprise’s virtualization monitoring capabilities and its unique inside/outside monitoring technology for virtual machines are available, see: VMware Monitoring Tools | eG Innovations.

On 5th Dec 2025, alerts were triggered because many of the ESXi server’s datastores were experiencing sudden increased latency. The figure below shows that the write latencies went up several times.

Screenshot of write latencies dramatically increasing on the LUNs on a VMware ESXi server

Figure 1: A dramatic increase in write latency across all datastores of a VMware ESXi server.

Monitoring of the VMs also revealed slow disk read times and this was causing business-critical applications to be slower than normal. At this point, the virtual administrator could see that a problem existed but had insufficient information to identify _why_ the issue had arisen.

The Storage Admin View

eG Enterprise’s unified monitoring capabilities allow virtual infrastructure, storage, network, and application components to be monitored from the same tool.

In an ESXi infrastructure, a SAN (Storage Area Network) is a dedicated, high-performance storage network that provides shared block-level storage to ESXi hosts. A SAN is where your virtual machine disks live, separate from the ESXi servers themselves.

The SAN was connected to multiple ESXi servers within the cluster. A LUN (Logical Unit Number) in SAN storage is a logical slice of storage carved out of a SAN and presented to a server or hypervisor as a disk. Several LUNs had been created in the SAN storage and presented to the ESXi cluster as datastores. All virtual machine files and virtual disks in the ESXi cluster resided on these SAN‑provisioned datastores.

eG Enterprise monitors storage devices using CLI, APIs, SNMP and other mechanisms. Monitoring of storage LUN capacity (see Figure 2) did not reveal a problem.

Screenshot showing the LUNs provisioned on VNXe storage within the eG Enterprise console

Figure 2: eG Enterprise was in place continuously monitoring the LUNs allocated on the storage.

At the same time, when high write latencies were seen on ESXi datastores, alerts were also being triggered from the storage components. An alert was raised about specific failure with the batteries on the VNXe storage (see Figure 3).

Screenshot of the alerts raised pinpointing a storage battery failure within eG Enterprise

Figure 3: The VNXe Storage Hardware layer had a battery failure alert

Understanding the Root Cause

A graph of the health state of the VNXe battery shows that the health state degraded exactly when high latency was seen for writes (see Figure 4). A failed storage battery results in the cache being disabled. When the cache is unavailable, I/O performance across all LUNs is impacted, which leads to degraded VM performance and latency issues across the ESXi hosts. This explains the effects that were seen across the ESXi infrastructure.

Screenshot of graph showing the sudden failure of a VNXe storage battery with a timeline that pinpoints the occurrence

Figure 4: The timing and sudden nature of the battery failure was captured

Visibility for the Whole Organization

This case highlights the need for IT teams to have visibility beyond point solutions. The degradation in the VM performance showed up in the vCenter console and an ESXi administrator relying on vCenter alone would have seen the performance degradation. However, without visibility into the storage layers, the true root-cause of the storage battery would not have been identified.

By providing both the virtualization admin and the storage admin with a unified view of all components and tiers, eG Enterprise was able to provide unified end-to-end visibility in a single pane of glass console, enabling fast and accurate diagnosis of the problem. If the admins had had fragmented views, the virtual administrator would not have had visibility on the root-cause of failure of the storage battery – the key information as to _why_ the VMs were experiencing performance issues and latency.

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.

About the Author

Karthik Ganesan is a Systems Manager at eG Innovations, he has worked out of our R&D office in Chennai for over 10 years. Karthik started his career as a hands-on network engineer and has particular empathy for those involved in frontline customer support.