Storage LUNs – ESX Test

LUN is a Logical Unit Number. It can be used to refer to an entire physical disk, or a subset of a larger physical disk or disk volume. The physical disk or disk volume could be an entire single disk drive, a partition (subset) of a single disk drive, or disk volume from a RAID controller comprising multiple disk drives aggregated together for larger capacity and redundancy.

The Storage LUNs – ESX test reports critical usage statistics pertaining to every LUN on the ESX server host. In the VMware environment, a LUN is typically referred to using its HBA path, where HBA stands for the Host Bus Adapter. The HBA path on the other hand is expressed using the notation: vmhba<HBA  ID:Target ID:LUN ID>. For example, if the HBA path is vmhba0:0:0, then the first 0 indicates the HBA ID, the second zero represents the Target ID, and the final 0 the LUN ID.  

Target of the test : An ESX server host

Agent deploying the test : An internal/remote agent

Outputs of the test : One set of results for each LUN on the ESX server host that is monitored

Configurable parameters for the test
  1. Test period - How often should the test be executed
  2. Host - The host for which the test is to be configured.
  3. port - The port at which the specified host listens. By default, this is NULL.
  4. esx user and esx password - In order to enable the test to extract the desired metrics from a target ESX server, you need to configure the test with an ESX USER and ESX PASSWORD. The user credentials to be passed here depend upon the mechanism used by the eG agent for collecting performance statistics from the ESX server and its VMs. These monitoring methodologies and their corresponding configuration requirements have been discussed hereunder:

    • Monitoring using the web services interface of the ESX server: Starting with ESX server 3.0, a VMware ESX server offers a web service interface using which the eG agent collects metrics from the ESX server. The VMware VI SDK is used by the agent to implement the web services interface. To use this interface for monitoring, this test should be configured with an ESX USER who has “Read-only” privileges to the target ESX server. By default, the root user is authorized to execute the test. However, it is preferable that you create a new user on the target ESX host and assign the “Read-only” role to him/her. The steps for achieving this have been elaborately discussed in Increasing the Memory Settings of the eG Agent that Monitors ESX Servers section.

      ESX servers terminate user sessions based on timeout periods. The default timeout period is 30 mins. When you stop an agent, sessions currently in use by the agent will remain open for this timeout period until ESX times out the session. If the agent is restarted within the timeout period, it will open a new set of sessions. If you want the eG agent to close already existing sessions before it opens new sessions, then you would have to configure all the tests with the credentials of an ESX user with permissions to View and stop sessions (prior to vSphere/ESX server 4.1, this was called the View and Terminate Sessions privilege). To know how to grant this permission to an ESX user, refer to Creating a Special Role on an ESX Server and Assigning the Role to a New User to the Server section.

      Sometimes, the VMware VI SDK may cache the hardware status metrics it collects and provide the test with the cached results. This may cause the eG agent to receive obsolete hardware status information from the SDK. This is also the reason why, you may at times notice a mismatch between the hardware status reported by the eG agent and by the vSphere client. To ensure that the eG agent always reports the current hardware status, you should configure the eG agent to obtain the hardware metrics from the VMware VI SDK only after the SDK resets the cache to clear its contents, and then refreshes the cache so that the latest hardware status information is fetched into it. To enable the eG agent to make the reset and refresh SDK calls, the esx user and esx password parameters must be configured with the credentials of a vSphere user with the Change Settings privilege. For that you need to create a special role on vSphere, assign the Change Settings privilege to that role, and then map the role with a new user on vSphere. The procedure for this is detailed in Configuring the eG Agent to Collect Current Hardware Status Metrics section.

    • Monitoring using the vCenter in the target environment: By default, the eG agent connects to each ESX server and collects metrics from it. While this approach scales well, it requires additional configuration for each server being monitored. For example, separate user accounts may need to be created on each server for read-only access to VM details. While monitoring large virtualized installations however, the agents can be optionally configured to monitor ESX servers using the statistics already available with different vCenter installations in the environment.

    In this case therefore, the ESX USER and ESX PASSWORD that you specify should be that of an Administrator or Virtual Machine Administrator in vCenter. However, if, owing to security constraints, you prefer not to use the credentials of such users, then, you can create a special role on vCenter with ‘Read-only’ privileges.

    Refer to Assigning the ‘Read-Only’ Role to a Local/Domain User to vCenter section to know how to create a user on vCenter.

    If the ESX server for which this test is being configured had been discovered via vCenter, then the eG manager automatically populates the esx user and esx password text boxes with the vCenter user credentials using which the ESX discovery was performed.

    Like ESX servers, vCenter servers too terminate user sessions based on timeout periods. The default timeout period is 30 mins. When you stop an agent, sessions currently in use by the agent will remain open for this timeout period until vCenter times out the session. If the agent is restarted within the timeout period, it will open a new set of sessions. If you want the eG agent to close already existing sessions before it opens new sessions, then you would have to configure all the tests with the credentials of a vCenter user with permissions to View and stop sessions (prior to vCenter 4.1, this was called the View and Terminate Sessions permission). To know how to grant this permission to a user to vCenter, refer to Creating a Special Role on vCenter and Assigning the Role to a Local/Domain User section. When the eG agent is started/restarted, it first attempts to connect to the vCenter server and terminate all existing sessions for the user whose credentials have been provided for the tests.

    This is done to ensure that unnecessary sessions do not remain established in the vCenter server for the session timeout period.  Ideally, you should create a separate user account with the required credentials and use this for the test configurations. If you provide the credentials for an existing user for the test configuration, when the eG agent starts/restarts, it will close all existing sessions for this user (including sessions you may have opened using the Virtual Infrastructure client). Hence, in this case, you may notice that your VI client sessions are terminated when the eG agent starts/restarts.

    Sometimes, the VMware VI SDK may cache the hardware status metrics it collects and provide the test with the cached results. This may cause the eG agent to receive obsolete hardware status information from the SDK. This is also the reason why, you may at times notice a mismatch between the hardware status reported by the eG agent and by the vSphere client. To ensure that the eG agent always reports the current hardware status, you should configure the eG agent to obtain the hardware metrics from the VMware VI SDK only after the SDK resets the cache to clear its contents, and then refreshes the cache so that the latest hardware status information is fetched into it. To enable the eG agent to make the reset and refresh SDK calls, the esx user and esx password parameters must be configured with the credentials of a vCenter user with the Change Settings privilege. For that you need to create a special role on vCenter, assign the Change Settings privilege to that role, and then map the role with a new user on vCenter. The procedure for this is detailed in Configuring the eG Agent to Collect Current Hardware Status Metrics section.

  5. confirm password - Confirm the password by retyping it here.
  6. ssl - By default, the ESX server is SSL-enabled. Accordingly, the SSL flag is set to Yes by default. This indicates that the eG agent will communicate with the ESX server via HTTPS by default.

    Like the ESX sever, the vCenter is also SSL-enabled by default. If you have chosen to use the vCenter for monitoring, then you have to set the SSL flag to Yes.

  7. webport - By default, in most virtualized environments, the vSphere/ESX server and vCenter listen on port 80 (if not SSL-enabled) or on port 443 (if SSL-enabled). This implies that while monitoring an SSL-enabled vSphere/ESX server directly, the eG agent, by default, connects to port 443 of the vSphere/ESX server to pull out metrics, and while monitoring a non-SSL-enabled server, the eG agent connects to port 80. Similarly, while monitoring a vSphere/ESX server via an SSL-enabled vCenter, the eG agent connects to port 443 of vCenter to pull out the metrics, and while monitoring via a non-SSL-enabled vCenter, the eG agent connects to port 80 of vCenter. 

    Accordingly, the webport parameter is set to 80 or 443 depending upon the status of the ssl flag.  In some environments however, the default ports 80 or 443 might not apply. In such a case, against the webport parameter, you can specify the exact port at which the vSphere/ESX server or vCenter in your environment listens so that the eG agent communicates with that port.

  8. VIRTUAL CENTER - If the eG manager had discovered the target ESX server by connecting to vCenter, then the IP address of the vCenter server used for discovering this ESX server would be automatically displayed against the vIRTUAL center parameter; similarly, the esx user and esx password text boxes will be automatically populated with the vCenter user credentials, using which ESX discovery was performed.

    If this ESX server has not been discovered using vCenter, but you still want to monitor the ESX server via vCenter, then select the IP address of the vCenter host that you wish to use for monitoring the ESX server from the vIRTUAL center list. By default, this list is populated with the IP address of all vCenter hosts that were added to the eG Enterprise system at the time of discovery. Upon selection, the esx user and esx password that were pre-configured for that vCenter server will be automatically displayed against the respective text boxes.

    On the other hand, if the IP address of the vCenter server of interest to you is not available in the list, then, you can add the details of the vCenter server on-the-fly, by selecting the Other option from the vIRTUAL center list. This will invoke the add vcenter server details page. Refer to Adding the Details of a vCenter Server for Guest Discovery section to know how to add a vCenter server using this page. Once the vCenter server is added, its IP address, esx user, and esx password will be displayed against the corresponding text boxes.

    On the other hand, if you want the eG agent to behave in the default manner -i.e., communicate with each ESX server for monitoring it - then set the VIRTUAL CENTER parameter to ‘none’. In this case, the ESX USER and ESX PASSWORD parameters can be configured with the credentials of a user who has at least ‘Read-only’ privileges to the target ESX server.

  9. REPORT DD IOPS VALUE ABOVE - To conserve the space on the database, this test allows you to generate the detailed diagnostics of the Total IOPS measure, ony if the value of the Total IOPS measure is greater than or equal to 100, by default. However, you can override this setting if necessary.

  10. REPORT DD THROUGHPUT VALUE ABOVE - To conserve the space on the database, this test allows you to generate the detailed diagnostics of the Throughput measure, ony if the value of the Throughput measure is greater than or equal to 5. However, you can override this setting if necessary.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

Physical disk reads

Indicates the rate at which read commands were issued.

Commands/Sec

 

Physical disk writes

Indicates the rate at which write commands were issued.

Commands/Sec

 

Issued commands

Indicates the number of commands issued per second.

Commands/Sec

 

Physical disk commands aborted

Indicates the number of commands aborted per second.

Aborts/Sec

 

Data writes to physical disk

Indicates the rate at which data was written.

MB/Sec

 

Data reads from physical disk

Indicates the rate at which data was read.

MB/Sec

 

Bus resets

Indicates the number of SCSI bus resets.

Number

The VMs use the SCSI protocol to communicate to disks, even over Fibre Channel to SAN Luns. SCSI Bus Resets are issued to release resources. These SCSI Bus Resets are in effect the SCSI subsystem timing out, commands being canceled, and retrying. This happens when the HBA device is overloaded, or its q-depth is exhausted. The first thing to know is which vmhba controller©, target/path (T), and LUN (L) experienced these problems. The more VMs sharing a single Lun the more likely that resets will occur. A rule of thumb is no more than 10 VMs sharing a Lun.

Guest read latency

Indicates the average amount of time taken for a read from the perspective of a guest operating system.

Secs

This is the sum of kernel latency and physical device read latency.

High latency is a cause for concern, as it is an indicator of contention for storage resources.

If the value of this measure is high, then check the values reported by the Kernel disk read latency and Physical device read latency measures for this storage adapter. Doing so enables you to quickly determine why exactly the guest OS is experiencing latencies while reading from the disk - is it because of latencies in the VM kernel? Or is it owing to a slowdown in the physical device?

Guest write latency

Indicates the average amount of time taken for a write from the perspective of a guest operating system.

Secs

This is the sum of kernel latency and physical device write latency.

High latency is always a cause for concern, as it is an indicator of contention for storage resources.

If the value of this measure is high, then check the values reported by the Kernel disk write latency and Physical device write latency measures for this storage adapter. Doing so enables you to quickly determine why exactly the guest OS is experiencing latencies while writing to this disk - is it because of latencies in the VM kernel? Or is it owing to a slowdown in the physical device?

Disk command latency

Indicates the average amount of time taken for a command to execute, from the perspective of a guest operating system.

Secs

This is the sum of kernel latency and physical device command latency.

High latency is always a cause for concern, as it is an indicator of contention for storage resources.

If the value of this measure is high, then check the values reported by the Kernel disk command latency and Physical device command latency measures for this storage adapter. Doing so enables you to quickly determine why exactly the guest OS is experiencing latencies while executing commands on this disk - is it because of latencies in the VM kernel? Or is it owing to a slowdown in the physical device?

Kernel disk read latency

Indicates the average time spent in the ESX server’s VM kernel per read.

Secs

A high value for this measure is a cause for concern. You might hence want to investigate the reasons for the same.

Kernel disk write latency

Indicates the average time spent in ESX server VM kernel per write.

Secs

Kernel disk command latency

Indicates the average time spent in the ESX server VM kernel per command.

Secs

Queue read latency

Indicates the average time spent in the ESX server VM kernel queue per read.

Secs

A high value of this measure indicates that the VMkernel is unable to process queued read requests quickly. If the problem persists, then the queue size could increase considerably.

To avoid this, swiftly determine the reasons for a slowdown at the VMkernel, and fix it.

Queue write latency

Indicates the average time spent in the ESX server VM kernel queue per write.

Secs

A high value of this measure indicates that the VMkernel is unable to process queued write requests quickly. If the problem persists, then the queue size could increase considerably.

To avoid this, swiftly determine the reasons for a slowdown at the VMkernel, and fix it.

Queue command latency

Indicates the average time spent in the ESX server VM kernel queue per command.

Secs

A high value of this measure indicates that the VMkernel is unable to process queued commands quickly. If the problem persists, then the queue size could increase considerably.

To avoid this, swiftly determine the reasons for a slowdown at the VMkernel, and fix it.

Physical device read latency

Indicates the average time taken to complete a read from the physical device.

Secs

Shortage of the physical storage resources can adversely impact the performance of the ESX host and the VMs configured on it.

Therefore, if the value of this measure is very high or is steadily increasing, then quickly figure out what is pulling down the performance of the physical disk, and attend to it.

Physical device write latency

Indicates the average time taken to complete a write to the physical device.

Secs

Physical device command latency

Indicates the average time taken to complete a command from the physical device.

Secs

Status

Indicates the current state of this LUN.

 

A LUN can be in any one of the following states:

  • Ok
  • error
  • Off
  • unknownState
  • lostCommunication
  • degraded
  • quiesced

The numeric values that correspond to each of the states discussed above are listed in the table below:

State Value

UnknownState

0

Ok

1

Off

2

Error

3

lostCommunication

4

degraded

5

quiesced

6

Note:

By default, this measure reports the States listed in the table above to indicate the status of a LUN. The graph of this measure however, represents the status of a LUN using the numeric equivalents - 0 to 6.

Number of multi paths available to a lun

Indicates the number of storage paths through which the host communicates with this LUN.

Number

To maintain a constant connection between an ESX/ESXi host and its storage, ESX/ESXi supports multipathing. Multipathing is a technique that lets you use more than one physical path that transfers data between the host and an external storage device. In case of a failure of any element in the SAN network, such as an adapter, switch, or cable, ESX/ESXi can switch to another physical path, which does not use the failed component. This process of path switching to avoid failed components is known as path failover.

In addition to path failover, multipathing provides load balancing. Load balancing is the process of distributing I/O loads across multiple physical paths. Load balancing reduces or removes potential bottlenecks.

Queue depth

Indicates the number of outstanding I/O requests to this LUN for which a response has not been received from the LUN.

Number

A low value is desired for this measure. A high value is indicative of a large number of pending requests for the LUN and hints at a potential processing bottleneck on the LUN.

Throughput

Indicates the rate at which data was read and written to this LUN.

MB/Sec

A high value is desired for this measure. A consistent decrease in the value of this measure can signal a potential slowdown of the LUN.

Capacity

Indicates the total capacity of this LUN.

MB

 

Capacity used

Indicates the amount of space used in this LUN.

MB

Compare the value of this measure across LUNs to know which LUN is consuming space excessively.

Capacity reserved

Indicates the amount of space reserved for this LUN.

MB

 

Free capacity

Indicates the amount of free space currently available for this LUN.

MB

Compare the value of this measure across LUNs to know which LUN is rapidly running out of space.

Capacity usage

Indicates the percentage of space in this LUN that is in use.

Percent

A low value is desired for this measure. A value close to 100% is a cause for concern, as it indicates a potential space crunch on the LUN. You may want to compare the value of this measure across LUNs to know which LUN is utilizing disk space excessively.

Is SSD?

Indicates whether/not this LUN is SSD or non-SSD.

 

Solid State Disks (SSD) offer a much higher throughput and much lower latency than traditional magnetic hard disks, since they are based on flash memory. vSphere hosts can use locally attached SSDs as a host swap cache, as virtual flash, as a vSAN, or as a regular datastore.

If the LUN being monitored is SSD-based, then the value of this measure wwill be Yes. If not, then the value of this measure will be No.

The numeric values that correspond to each of the measure values listed above are as follows:

Measure value Numeric Value
Yes 1
No 0

Note:

By default, this measure reports the Measure Values listed in the table above. In the graph of this measure however, the SSD status of the LUN is represented using the corresponding numeric equivalents only.

LUN health

Indicates the current health status of this LUN.

 

The values that this measure can report and their corresponding numeric values are discussed in the table above:

 

Measure Value Numeric Value

Healthy

0

Failed

4

Offline

5

Decommissioned

6

Permanent disk failure

16

Note:

By default, this measure reports the Measure Values listed in the table above. In the graph of this measure however, the health status of the LUN is represented using the corresponding numeric equivalents only.