What is a SAN?
A Storage Area Network (SAN) is a specialized, high-speed network that provides block-level network access to storage. SANs are typically composed of hosts, switches, storage elements, and storage devices that are interconnected using a variety of technologies, topologies, and protocols. Each computer on the network can access storage on the SAN as if they are local disks connected directly to the computer.
SANs are often used to:
- Improve application availability (e.g., multiple data paths)
- Enhance application performance (e.g., off-load storage functions, segregate networks, etc.)
- Increase storage utilization and effectiveness (e.g., consolidate storage resources, provide tiered storage, etc.), and improve data protection and security
SANs also typically play an important role in an organization’s Business Continuity Management (BCM) activities.
Why Monitor SAN Performance?
SANs are the backbone of on-premises storage. They are also heavily used by major cloud providers. As websites, digital apps, and file servers rely heavily on storage devices, good SAN performance is crucial for good IT application performance. Furthermore, as storage is often shared across systems and applications, failure or slowdown of the storage tier affects all the applications that depend on it.
Monitoring of SAN performance is required to:
- Achieve the highest possible application or data uptime.
- Ensure that storage errors, failures, or bottlenecks such as excessive I/O activity on a specific logical unit (LUN), etc., do not lead to downtime or slow performance.
- Assist with bottleneck detection and planning to ensure the best utilization of server, storage, and network resources.
- Identify potential areas of congestion and latency before they affect application performance.
Where to Monitor SAN Performance From?
At which level should one monitor SAN performance is an interesting question. For example, storage performance can be monitored at different levels within a virtualized environment such as one deploying VMware vSphere.
- At the VM guest level: Using WMI and other OS commands, you can monitor the activity of the different disk drives on a system. Percentage of time a disk is busy, disk queue length, disk read and write latencies are the key metrics to be tracked. If disk queue length increases and if the performance of time that a logical disk drive is busy is close to 100%, it is an indicator that application performance within the VM may be affected.
- At the physical host level: The virtual disks inside VMs are provisioned from datastores, which are maintained by the virtualized infrastructure. A datastore is like a storage appliance that serves up storage space for virtual disks inside the VMs and stores the VM definitions themselves. Monitoring performance metrics at the datastore level can indicate if the datastore is seeing excessive IOPS or its read/write latency has increased.
Datastores are mapped to storage LUNs. The virtual platform also provides performance indications at the LUN level. The performance of a LUN as measured by a VMware vSphere server represents the workload and latency from that server. LUNs can be shared across VMware servers. Hence, you need to look at LUN metrics across all the vSphere servers to see the total workload of the LUN.
It must be noted that what you are seeing in Figure 4 is the workload from the vSphere environment to that LUN. If other types of servers are directly connected to the LUN, metrics from the vSphere environment do not reflect the total workload of the LUN. For example, a physical server used for backups may be directly connected to the LUN and in this case, when the LUN slows down, the VMware metrics can highlight the slowness but the cause of the issue – i.e., excessive demand from the backup server – may be missed. This is why monitoring of storage performance from the client-side (e.g., the servers that connect to the SAN) is not sufficient.
At the same time, the performance metrics at the LUN, datastore, and VM levels can help isolate the problem to a storage issue, but they may not provide sufficient detail to help diagnose the problem. You can’t tell what caused a slowdown – Is it due to one of the disks in a RAID group failing, or due to a hardware issue on the storage device? To get this level of detail, you will need in-depth monitoring of the storage tier.
Figure 5: How communication from a VM to a storage device happens
From the storage tier: Figure 5 depicts how the storage tier supports other tiers – in this example, the virtualization tier. When a guest OS reads/writes to a disk, device drivers in the VM OS communicate with the virtual disk controllers on the VMware server. The virtual disk controller forwards the command to the VMkernel. The VMkernel maps the requests to blocks on the appropriate physical device. The host HBA transmits the request via the storage fabric to one of the SAN switches, which routes it to the corresponding storage device.
To monitor the storage tier, it is essential to monitor the SAN switch, the storage processors, the storage array, and the physical disks. Protocols such as SMI-S and SNMP, or command line interfaces supported by the different storage device vendors can be used for storage monitoring.
Monitoring of storage performance from the client-end (e.g., from VMware vSphere servers) is NOT equivalent to monitoring of the storage tier in-depth.
Monitoring of the storage performance from the client end and from the storage tier can be complementary.
The table below summarizes how these views complement each other.
|Where to monitor storage performance from||What it reveals|
|Monitoring from the VM||Which application/process is causing many I/O operations?
Which file(s) are being accessed?
Is it due to read/write operations?
|Monitoring from the host||Which VM is causing many I/O operations?
Which LUN is seeing slowness?
Is a specific datastore not performing as well as the others?
|Monitoring from the storage tier||Is a specific LUN on the storage tier causing slowness?
Have any of the disks in a RAID group failed?
Requirements for SAN Performance Monitoring
- Multi-vendor support: Storage architectures from different vendors may not be the same. Some use in-memory caching, some use intelligent mechanisms to layout blocks on disk. The protocols supported for monitoring and the metrics that matter will also differ from one vendor to another. If multiple monitoring tools must be used, one for each vendor, it makes the monitoring process time-consuming and laborious. So, at a minimum, a monitoring tool for storage tiers must cover all the different vendor technologies that are being deployed by an organization.
- Monitoring of all storage components: To provide a unified view of the storage tier, it is important to monitor the storage devices as well as the SAN switches. Any errors seen by the SAN switches need to be flagged proactively, so they can be addressed before they cause user-noticeable issues.
- Addressing workload constraints: Workload constraints can exist at each processing stage – from the host bus adapter (HBA) via the SAN switch fabric to the storage system front-end ports and then through the CPU and cache to the back-end media. A bottleneck can be created by an overworked component anywhere along the channel and can affect overall storage performance.
- Detecting workload changes: Unexpected changes in workload can affect storage performance. SAN monitoring tools must be capable to detecting such changes and alerting administrators proactively. Auto-baselining of current workloads and real-time comparison with past patterns can help identify such changes.
eG Enterprise for Storage Performance Monitoring
eG Enterprise addresses all the key requirements for storage performance monitoring:
eG Enterprise has out-of-the-box support for 20+ storage device types. These include Dell EqualLogic, IBM DS RAID, HP EVA, HP 3PAR, EMC VNX and Clarion, and Hitachi AMS, Netapp USD, Netapp Cluster USP, etc. Unlike in the case of network monitoring where SNMP is a standard, with storage monitoring, there is no standard. So, eG Enterprise uses various mechanisms – SNMP, SMI-S, command line interfaces (CLI), etc. to monitor different storage technologies.
With eG Enterprise, IT admins have a single unified console from where they can monitor, diagnose, and report on heterogeneous device types. A layered stack model paradigm makes it easy for administrators to monitor different storage device types.
Questions that can be answered using eG Enterprise include:
- Is there an issue with the storage hardware: enclosures, power supplies, fans, etc.?
- How many physical disks have been configured and are they working well?
- How many disk pools are there, what is the free capacity of each pool and are any of the pools heavily fragmented?
- Is any of the LUNs overloaded? What is the IOPS volume on each LUN and what is the queue length for each LUN?
- How busy are the storage processors? What is the throughput from each storage processor?
- Using SNMP, eG Enterprise also monitors the SAN switches. Periodic polling is used to check on the health and usage of these switches. eG Enterprise also receives SNMP traps from storage devices and can immediately alert administrators to any issues.
- From the same web console, administrators can also monitor storage performance as seen by the host – at the LUN level, storage adapter level, and datastore level. Using eG Enterprise’s unique ‘inside and outside monitoring of VMs’, administrators can also track storage usage within VMs to identify any applications that may be causing excessive I/O activity.
From the eG Enterprise web console, IT admins have access to a host of capabilities to analyze the collected metrics:
- Auto-baselining provides insights into norms of metrics for the different storage tiers.
- Customized dashboards can be created by using the “My Dashboard” capability of eG Enterprise.
- Storage tiers can be mapped in application topology maps, allowing IT admins to look at the end-to-end performance of all tiers that affect an application’s performance.
- Different types of reports are also available for analyzing past history as well as forecasting future performance trends.
There’s a lot of confusion around what storage monitoring really is about. In this blog, we have highlighted why storage monitoring is important and how different perspectives of storage performance can be obtained – from the storage devices themselves, and from the VMs and hosts that use the storage devices.