vSAN Disk Group Test
vSAN architecture consists of two tiers: a cache tier for the purpose of read caching and write buffering, and a capacity tier for persistent storage. This two tier design offers supreme performance to VMs while ensuring that devices can have data written to them in the most efficient way possible. For this purpose, vSAN uses a logical construct called disk groups to manage the relationship between capacity devices and their cache tier. In a Virtual SAN enabled cluster, multiple disk groups are created on a host so as to improve storage performance significantly while limiting the size of a failure domain; reducing the amount of data impacted by physical device failure.
When a hardware device, host, or network fails, or if a host is placed into maintenance mode or traffic congestion, vSAN initiates resynchronization in the vSAN cluster. However, vSAN might briefly wait for the failed components to come back online before initiating resynchronization tasks. To ensure peak performance even after failures and resynchronization, it becomes necessary for administrators to track the IO operations performed on the disk groups at regular intervals. This is where the vSAN DiskGroup test helps administrators.
This test auto-discovers the vSAN disk groups in the vSAN enables clusters in the vCenter and reveals how well the IO operations were performed during resynchronization on each disk group. In the process, this test also reports statistics related to space utilization and congestion on the disk groups. In addition, this test measures the throughput and latency while performing the IO operations in the frontend of each disk group. This revelation sheds light on the IO processing delays, if any, and enables administrators to take necessary actions immediately.
Note:
This test is applicable only for the vSAN enabled clusters in the VMware vCenter server.
Target of the test : A VMware vCenter server
Agent deploying the test : An internal agent
Outputs of the test : One set of results for each vSAN disk group in the vSAN enabled cluster in the VMware vCenter server.
Parameter | Description |
---|---|
Test Period |
How often should the test be executed. |
Host |
The host for which this test is to be configured. |
Port |
Refers to the port at which the specified host listens to. |
VC User and VC Password |
To connect to vCenter and extract metrics from it, this test should be configured with the name and password of a user with Administrator or Virtual Machine Administrator privileges to vCenter. However, if, owing to security constraints, you are not able to use the credentials of such users for test configuration, then you can configure this test with the credentials of a user with Read-only rights to vCenter. For this purpose, you can assign the ‘Read-only’ role to a local/domain user to vCenter, and then specify name and password of this user against the VC User and VC Password text boxes. The steps for assigning this role to a user on vCenter have been detailed in vCenter servers terminate user sessions based on timeout periods. The default timeout period is 30 mins. When you stop an agent, sessions currently in use by the agent will remain open for this timeout period until vCenter times out the session. If the agent is restarted within the timeout period, it will open a new set of sessions. If you want the eG agent to close already existing sessions on vCenter before it opens new sessions, then, instead of the ‘Read-only’ user, you can optionally configure the VC User and VC Password parameters with the credentials of a user with permissions to View and Stop Sessions on vCenter. For this purpose, you can create a special role on vCenter, grant the View and Stop Sessions privilege (prior to vCenter 4.1, this was called the View and Terminate Sessions privilege) to this role, and then assign the new role to a local/domain user to vCenter. The steps for assigning this role to a user on vCenter have been detailed in |
Confirm Password |
Confirm the password by retyping it in this text box. |
SSL |
By default, the vCenter server is SSL-enabled. Accordingly, the SSL flag is set to Yes by default. This indicates that the eG agent will communicate with the vCenter server via HTTPS by default. |
Webport |
By default, in most virtualized environments, vCenter listens on port 80 (if not SSL-enabeld) or on port 443 (if SSL-enabled) only. This implies that while monitoring vCenter, the eG agent, by default, connects to port 80 or 443, depending upon the SSL-enabled status of vCenter – i.e., if vCenter is not SSL-enabled (i.e., if the SSL flag above is set to No), then the eG agent connects to vCenter using port 80 by default, and if vCenter is SSL-enabled (i.e., if the ssl flag is set to Yes), then the agent-vCenter communication occurs via port 443 by default. Accordingly, the Webport parameter is set to default by default. In some environments however, the default ports 80 or 443 might not apply. In such a case, against the Webport parameter, you can specify the exact port at which vCenter in your environment listens, so that the eG agent communicates with that port for collecting metrics from vCenter. |
DD Frequency |
Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency. |
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation |
---|---|---|---|
Frontend read IOPS |
Indicates the number of Frontend write operations performed on this vSAN disk group per second. |
IOPS |
Virtual machines are considered front-end – where the application on the virtual machine reads and writes to the vSAN disks. Compare the values of these measures across the disk groups to know what is contributing to the abnormal I/O activity levels - read operations? or write operations? |
Frontend write IOPS |
Indicates the number of Frontend read operations performed on this vSAN disk group per second. |
IOPS |
|
Frontend read throughput |
Indicates the rate at which this disk group processes the Frontend read requests. |
MB/Sec |
|
Frontend write throughput |
Indicates the rate at which this disk group processes the Frontend write requests. |
MB/Sec |
|
Frontend read latency |
Indicates the average time taken by this disk group to process read Frontend read requests. |
Seconds |
Ideally, the values of these measures should be very low. By comparing the values of these measures, administrators can figure out where the slowness is maximum - when processing Frontend read requests? or Frontend write requests? |
Frontend write latency |
Indicates the average time taken by this disk group to process read Frontend write requests. |
Seconds |
|
Capacity |
Indicates the total capacity of this vSAN disk group. |
GB |
|
Used capacity |
Indicates the amount of space utilized from the total capacity of this vSAN disk group. |
GB |
Ideally, the value of this measure should be low. If the value of this measure is close to the Capacity measure, it indicates that the disk group is running out of space. Compare the value of this measure across the disk groups to identify the disk group that is being over utilized. |
Reserved capacity |
Indicates the amount of space that has been reserved for future use on this disk group. |
GB |
The space reserved on each disk group will be provisioned to the hosts after host failures or during maintenance. This way administrators can ensure that sufficient free capacity will be available for components to successfully rebuild after the host failures or during maintenance. |
Read Cache hit rate |
Indicates the percentage of reads are delivered from the read cache for this disk group. |
Percent |
A high value is desired for this measure. A grdual/significant decrease in the value of this measure indicates that the read performance is deteriorating while performing read operations from the read cache. In such a case, administrators should increase the size of the vSAN caching tier by adding more disk groups. Alternatively, administrators can tune the working set of the benchmark by doing one of the following:
|
Read cache size |
Indicates the size of the read cache managed by this disk group. |
GB |
VMware vSAN leverage SSD devices of each disk group as the "performance tier" for caching purpose. The purpose of leveraging SSD devices for caching is to serve the highest possible ratio of read operations from the data stored in the read cache and to minimize the read operations to be served by the capacity disks. |
Read cache read IOPS |
Indicates the number of read operations processed from the read cache. |
IOPS |
|
Read cache write IOPS |
Indicates the number of write operations processed from the read cache. |
IOPS |
|
Read cache Read latency |
Indicates the time taken by the read cache for processing the read requests. |
Seconds |
|
Read cache write latency |
Indicates the time taken by the read cache for processing the write requests. |
Seconds |
|
Write buffer size |
Indicates the size of the write buffer of this disk group. |
GB |
The vSAN uses "Write buffers" to de-stage written data (not individual write operations) in a way that will create a benign near-sequential (proximal) write workload for the HDDs that form the capacity tier of the vSAN disk group. |
Write Buffer free percentage |
Indicates the percentage of space available for use in the write buffer of this disk group. |
Percent |
|
Write buffer read IOPS |
Indicates the number of read operations processed from the write buffer. |
IOPS |
|
Write buffer write IOPS |
Indicates the number of write operations processed from the write buffer. |
IOPS |
|
Write buffer read latency |
Indicates the time taken while processing read operations from the write buffer. |
Seconds |
|
Write buffer write latency |
Indicates the time taken while processing write operations from the write buffer. |
Seconds |
|
Bytes De-stage from SSD |
Indicates the rate at which the bytes were destaged from the SSD |
MB/Sec |
|
Zero-bytes De-stage |
Indicates the rate at which zeroed-out blocks are skipped when destaging data from SSD. |
MB/Sec |
|
Mem congestion |
Indicates the number of times the Mem congestion occurred on this disk group. |
Number |
Congestion is a flow control mechanism used by vSAN. Whenever a bottleneck occurs in a lower layer of vSAN (closer to the physical storage devices), vSAN uses this flow control (aka congestion) mechanism to relieve the bottleneck in the lower layer and instead reduce the rate of incoming I/O at the vSAN ingress, i.e. vSAN Clients (VM Consumption). This reduction of the incoming rate is done by introducing an IO delay at the ingress that is equivalent to the delay the IO would have occurred due to the bottleneck at the lower layer. Thus, it is an effective way to shift latency from the lower layers to the ingress without changing the overall throughput of the system. Mem congestion occurs when the size of used memory heap by vSAN internal components exceed the threshold. |
Slab congestion |
Indicates the number of times the Slab congestion occurred on this disk group. |
Number |
Slab congestion is reported when the number of inflight operations exceed the capacity of vSAN internal operation slabs. |
SSD congestion |
Indicates the number of times the SSD congestion occurred on this disk group. |
Number |
SSD congestion occurs when the cache tier disk write buffer runs out of space. |
Log congestion |
Indicates the number of times the Log congestion occurred on this disk group. |
Number |
Log congestion occurs when vSAN internal log in cache tier disk runs out of space. |
Comp congestion |
Indicates the number of times the Comp congestion occurred on this disk group. |
Number |
Comp congestion occurs when the size of internal table used for vSAN object components is exceeding the threshold. |
Cache invalidations |
Indicates the number of cache lines that are invalidated due to excessive write operations on this disk group. |
Number |
Cache invalidations are an indicator for the number of writes on the same address offset as an existing data in the read cache. When a write operation to an IO address follows a read operation, the contents of the read cache must be updated. Such an eviction is referred to as a cache invalidation. |
Evictions |
Indicates the number of times the read cache contents were evicted due to read cache contention. |
Number |
Typically, contents in the read cache are evicted when the working set size is larger than the size of the read cache. A low value is desired for this measure. A gradual/sudden increase in the value of this measure indicates the deterioration in the read cache performance. |
Outstanding write OPs |
Indicates the number of outstanding write operations performed on this disk group. |
Number |
|
Outstanding recovery write OPs |
Indicates the number of outstanding recovery write operations performed on this disk group. |
Number |
|
Outstanding write I/O size |
Indicates the amount of data written on this disk group during the outstanding write operations. |
GB |
|
Outstanding recovery write I/O size |
Indicates the amount of data written on this disk group during the outstanding recovery write operations. |
GB |
|
Resync read IOPS caused by policy change |
Indicates the number of IO read operations used for performing resynchronization on this disk group due to change in policy settings. |
IOPS |
When there is a change in VM storage policy settings, vSAN might initiate object recreation and subsequent resynchronization of the objects. Compare the values of these measures across the vSAN disk groups to identify the disk group on which maximum number of read and write operations are performed for resynchronization due to change in policy settings. |
Resync write IOPS caused by policy change |
Indicates the number of IO write operations used for performing resynchronization on this disk group due to change in policy settings. |
IOPS |
|
Resync read IOPS caused by decommission |
Indicates the number of IO read operations used for performing resynchronization on this disk group due to decommission. |
IOPS |
Typically, decommissioning is performed for disk groups from vSAN while upgrading a device or replacing a failed device, or removing a cache device. Compare the values of these measures across the vSAN disk groups to identify the disk group on which maximum number of read and write operations are performed for resynchronization due to decommission. |
Resync write IOPS caused by decommission |
Indicates the number of IO write operations used for performing resynchronization on this disk group due to decommission. |
IOPS |
|
Resync read IOPS caused by by rebalancing objects |
Indicates the number of IO read operations used for performing resynchronization on this disk group while rebalancing the objects. |
IOPS |
|
Resync write IOPS caused by rebalancing objects |
Indicates the number of IO write operations used for performing resynchronization on this disk group while rebalancing the objects. |
IOPS |
|
Resync read IOPS caused by object repair |
Indicates the number of IO read operations used for performing resynchronization on this disk group due to the object repair operation. |
IOPS |
|
Resync write IOPS caused by object repair |
Indicates the number of IO write operations used for performing resynchronization on this disk group due to the object repair operation. |
IOPS |
|
Resync read throughput caused by policy change |
Indicates the rate at which the data is read for performing resynchronization on this disk group due to change in policy settings. |
MB/Sec |
|
Resync write throughput caused by policy change |
Indicates the rate at which the data is written for performing resynchronization on this disk group due to due to change in policy settings. |
MB/Sec |
|
Resync read throughput caused by decommission |
Indicates the rate at which the data is read for performing resynchronization on this disk group due to the decommission. |
MB/Sec |
|
Resync write throughput caused by decommission |
Indicates the rate at which the data is written for performing resynchronization on this disk group due to the decommission. |
MB/Sec |
|
Resync read throughput caused by rebalancing objects |
Indicates the rate at which the data is read for performing resynchronization on this disk group while rebalancing the objects. |
MB/Sec |
|
Resync write throughput caused by rebalancing objects |
Indicates the rate at which the data is written for performing resynchronization on this disk group caused by repairing the objects. |
MB/Sec |
|
Resync read throughput caused by object repair |
Indicates the rate at which the data is read for performing resynchronization on this disk group caused by repairing the objects. |
MB/Sec |
|
Resync write throughput caused by object repair |
Indicates the rate at which the data is written for performing resynchronization on this disk group caused by repairing the objects. |
MB/Sec |
|
Resync read latency caused by policy change |
Indicates the average time taken to perform read operations for performing resynchronization due to change in policy settings. |
Seconds |
vSAN cluster read average latency of resync traffic, including policy change, repair, maintenance mode / evacuation and rebalance from resyncing objects in the perspective of vSAN backend. |
Resync write latency caused by policy change |
Indicates the average time taken to perform write operations for performing resynchronization due to change in policy settings. |
Seconds |
|
Resync read latency caused by decommission |
Indicates the average time taken to perform read operations during resynchronization due to decommission. |
Seconds |
|
Resync write latency caused by decommission |
Indicates the average time taken to perform write operations during resynchronization due to decommission. |
Seconds |
|
Resync read latency caused by rebalancing objects |
Indicates the average time taken to perform read operations during resynchronization caused by rebalancing the objects. |
Seconds |
|
Resync write latency caused by rebalancing objects |
Indicates the average time taken to perform write operations during performing resynchronization caused by rebalancing the objects. |
Seconds |
|
Resync read latency caused by object repair |
Indicates the average time taken to perform read operations during resynchronization due to object repair. |
Seconds |
|
Resync write latency caused by object repair |
Indicates the average time taken to perform write operations during resynchronization due to object repair. |
Seconds |
|