Azure NetApp Files Volumes Test

The Azure NetApp Files service is an enterprise-class, high-performance, metered file storage service. Azure NetApp Files supports any workload type and is highly available by default.

At the top of the storage hierarchy of this file service is the NetApp account. A NetApp account serves as an administrative grouping of the constituent capacity pools. A capacity pool is measured by its provisioned capacity. A capacity pool contains one/more volumes. A volume is measured by logical capacity consumption and is scalable.

A volume's capacity consumption counts against its pool's provisioned capacity. This means that if one/more volumes in a pool consume storage space excessively, then other volumes in that pool will be forced to contend for the available capacity. This can adversely impact storage performance. To avoid this, administrators should monitor how each volume is using its allocated storage capacity, and promptly identify volumes that may soon outgrow their capacity assignments.

Other than adequate storage capacity, volumes also need sufficient throughput to process their workload. Typically, volumes will use as much throughput as is allowed by the 'throughput limit' assigned to them. How this limit is computed for a volume varies according to the QoS type of its capacity pool. QoS type is an attribute of a capacity pool. The QoS type of a capacity pool can be auto (default) or manual. In an auto QoS capacity pool, throughput is assigned automatically to the volumes in the pool, proportional to the size quota assigned to the volumes. In a manual QoS capacity pool, you can assign the capacity and throughput for a volume independently. The total throughput of all volumes created with a manual QoS capacity pool is limited by the total throughput of the pool. It is determined by the combination of the pool size and the service-level throughput. Service levels are defined and differentiated by the allowed maximum throughput for a volume in the capacity pool based on the quota that is assigned to the volume.

If a volume does not have sufficient throughput, then it will not be able to process its workload quickly, which in turn can lead to serious latencies in I/O processing. To avoid this, administrators should track the IOPS performed on each volume, measure the time taken by every volume to process the IOPS, and rapidly capture those volumes that are experiencing latencies in I/O processing.

Also, for data protection, you can asynchronously replicate data from an Azure NetApp Files volume (source) in one region to another Azure NetApp Files volume (destination) in another region. This capability enables you to failover your critical application in case of a region-wide outage or disaster. If the destination volume is unhealthy, or if the data transfer between the source and destination takes too long, then in the event that disaster strikes, significant data loss may occur. To avoid this, administrators must continuously track the health of the destination volumes and observe data transfers between volumes, promptly capture abnormalities, and rapidly initiate corrective measures.

With the help of the Azure NetApp Files Volumes test, administrators can monitor the capacity and throughput usage of each volume, the replication activity on every volume, and rapidly point to problematic volumes.

For each volume, this test reports the allocated capacity, measures the capacity usage, and accurately pinpoints those volumes that are about to exhaust their allocations. This way, the test leads administrators to those volumes for which the size quota may have to be reset. The test also tracks the read/write operations performed on each volume, and proactively alerts administrators if high latencies are noticed in I/O processing on any volume. In the process, administrators can determine wheter/not the throughput provisioning for volumes need to be adjusted to improve their processing power. Furthermore, the test also scrutinizes the replication activity on every volume, and points to potential problem areas, so that administrators can endeavour to eliminate replication bottlenecks and avoid data loss during disaster recovery.

Target of the Test: A Microsoft Azure Subscription

Agent deploying the test: A remote agent

Output of the test: One set of results for every volume configured for each resource group in the target Azure subscription

Configurable parameters for the test
Parameters Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Subscription ID

Specify the GUID which uniquely identifies the Microsoft Azure Subscription to be monitored. To know the ID that maps to the target subscription, do the following:

  1. Login to the Microsoft Azure Portal.

  2. When the portal opens, click on the Subscriptions option (as indicated by Figure 1).

    Figure 1 : Clicking on the Subscriptions option

  3. Figure 2 that appears next will list all the subscriptions that have been configured for the target Azure AD tenant. Locate the subscription that is being monitored in the list, and check the value displayed for that subscription in the Subscription ID column.

    Figure 2 : Determining the Subscription ID

  4. Copy the Subscription ID in Figure 2 to the text box corresponding to the SUBSCRIPTION ID parameter in the test configuration page.

Tenant ID

Specify the Directory ID of the Azure AD tenant to which the target subscription belongs. To know how to determine the Directory ID, refer to Configuring the eG Agent to Monitor the Microsoft Azure App Service

Client ID and Client Password

The eG agent communicates with the target Microsoft Azure Subscription using Java API calls. To collect the required metrics, the eG agent requires an Access token in the form of an Application ID and the client secret value. To know how to determine the Application ID and the key, refer to Configuring the eG Agent to Monitor the Microsoft Azure App Service. Specify the Application ID of the created Application in the Client ID text box and the client secret value in the Client Password text box.

Proxy Host

In some environments, all communication with the Azure cloud be routed through a proxy server. In such environments, you should make sure that the eG agent connects to the cloud via the proxy server and collects metrics. To enable metrics collection via a proxy, specify the IP address of the proxy server and the port at which the server listens against the Proxy Host and Proxy Port parameters. By default, these parameters are set to none, indicating that the eG agent is not configured to communicate via a proxy, by default.

Proxy Username, Proxy Password and Confirm Password

If the proxy server requires authentication, then, specify a valid proxy user name and password in the Proxy Username and Proxy Password parameters, respectively. Then, confirm the password by retyping it in the Confirm Password text box.

DD Frequency

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measures made by the test:
Measurement Description Measurement Unit Interpretation

Details of volumes

Indicates the number of volumes configured for this storage

Number

Use the detailed diagnosis of this measure to know the protocol type and mount path of the volume, and the capacity pool to which the volume belongs.

Quota

Indicates the set size of the volume.

GB

The storage capacity of an Azure NetApp Files volume is limited to the set size (quota) of the volume.

Service level

Indicates the current service level of this volume.

 

The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value

Description

Standard 1

Provides up to 16 MiB/s of throughput per 1 TiB of capacity provisioned.

Premium 2

Provides up to 64 MiB/s of throughput per 1 TiB of capacity provisioned

Ultra 3

Provides up to 128 MiB/s of throughput per 1 TiB of capacity provisioned

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current service level of each volume. The graph of this measure however, represents the service level of a volume using the numeric equivalents only.

Percentage volume consumed size

Indicates the percentage of storage capacity allocated to this volume that is currently in use.

Percent

If the value of this measure is close to 100%, it implies that the volume is rapidly consuming the storage space allocated to it, and will soon run out of space.

if a volume maxes out on space - i.e., if the volume of this measure is 1pp% - then the volume will receive an “out of space” condition. In such a situation, you may want to resize the volume or capacity pool.

Volume allocated size

Indicates the storage space allocated to this volume.

GB

 

Volume consumed size

Indicates the storage space currently consumed by this volume.

GB

Ideally, the value of this measure should be low.

If the value of this measure is very close to that of the Volume allocaed size measure, its a cause for concern, as it means that the volume is fast running out of space.

If the value of this measure is equal to that of the Volume allocated size measure, then the volume will receive an “out of space” condition. In such a situation, you may want to resize the volume or capacity pool.

Volume snapshot size

Indicates the space in this volume that is occupied by snapshots.

GB

An Azure NetApp Files snapshot is a point-in-time file system (volume) image. It is ideal to serve as an online backup. You can use a snapshot to create a new volume (clone), restore a file, or revert a volume.

If a volume is found to be running out of storage space, then you can use this measure to determine whether/not snapshots are hogging space in that volume. Removing unwanted snapshots can help free up storage space in the volume.

Total latency

Indicates the average time taken to read from / write to this volume.

Milliseconds

A consistent rise in the value of this measure is a cause for concern. It implies that the volume is unable to process read/write requests quickly. To know where the slowness is more pronounced - when reading? or when writing? - compare the Average read latency and Average write latency measures for that volume.

A probable cause for high latency is insufficient throughput. To avoid this, you may want to consider increasing the throughput limit - i.e., throughput provisioned - of the volume.

The throughput limit for a volume with automatic QoS is determined by a combination of the quota assigned to the volume and the service level selected. For volumes with manual QoS, the throughput limit can be defined individually.

In case of volumes with automatic QoS, you will have to increase the volume quota and/or change the service level to up the ceiling on throughput. For example, if an automatic QoS volume in the Premium storage tier has only 500 GiB of data but requires 128 MiB/s of throughput, you can set the quota to 2 TiB so that the throughput level is set accordingly (64 MiB/s per TB * 2 TiB = 128 MiB/s). However, note that adjusting volume quota might require a change in capacity pool size.

If you consistently overprovision a volume for achieving a higher throughput, consider using the manual QoS volumes or using a higher service level instead. In the example above, you can achieve the same throughput limit with half the automatic QoS volume quota by using the Ultra storage tier instead (128 MiB/s per TiB * 1 TiB = 128 MiB/s).

f you use manual QoS volumes, you do not have to overprovision the volume quota to achieve a higher throughput because the throughput can be assigned to each volume independently. However, you still need to ensure that the capacity pool is pre-provisioned with sufficient throughput for your performance needs. The throughput of a capacity pool is provisioned according to its size and service level.

Average read latency

Indicates the average time for reads from this volume.

Milliseconds

 

Average write latency

Indicates the average time for writes to this volume.

Milliseconds

 

Total iops

Indicates the rate at which I/O operations are processed by this volume.

Operations/sec

A consistent drop in the value of this measure is a sign of processing bottlenecks in the volume. To know which type of operations were affected more by the processing latency, compare the value of the Read iops and Write iops measures for that volume.

Read iops

Indicates the number of reads to this volume per second.

Operations/sec

 

Write iops

Indicates the number of writes to this volume per second.

Operations/sec

 

Is volume replication status healthy?

Indicates the condition of the replication relationship of this volume.

 

The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value
Healthy 1
Unhealthy 0

This measure will report the value Unhealthy if previous replication jobs are not complete. This status is a result of large volumes being transferred with a lower transfer window (for example, a ten-minute transfer time for a large volume).

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the current health of the replication relationship of the volume. The graph of this measure however, is represented using the numeric equivalents only.

Is volume replication transferring?

Indicates the replication status of this volume.

 

The values reported by this measure and its numeric equivalents are mentioned in the table below:

Measure Value Numeric Value

Description

Transferring 1

A transfer operation is in progress and future transfers are not disabled.

Idle 0

No transfer operation is in progress and future transfers are not disabled

If the value of the Is volume replication status healthy measure is Unhealthy, and the value of this measure is Transferring, it could imply that previous replication jobs are not complete. This status is a result of large volumes being transferred with a lower transfer window (for example, a ten-minute transfer time for a large volume).

Note:

By default, this measure reports the Measure Values listed in the table above to indicate the whether/not the replication of each volume is transferring. The graph of this measure however, represents the service level of a volume using the numeric equivalents only.

Volume replication lag time

Indicates the average time taken to replicate this volume.

Seconds

A high value for this measure means replication is taking a long time.

Volume replication last transfer duration

Indicates the time it took for the last transfer from/to this volume to complete.

Seconds

If the value of this measure is high, it means that the last transfer took a long time. If the Volume replication lag time is unusually high, then you may want to check the value of this measure to know if this is owing to the last transfer.

Volume replication last transfer size

Indicates the total number of bytes transferred as part of the last transfer from/to this volume.

Bytes

If the Volume replication last transfer duration is abnormally high, then look up the value of this measure to determine if the delay is owing to the large size of the last transfer.

Volume replication progress

Indicates the total amount of data transferred for the current transfer operation from/to this volume.

Bytes

 

Volume replication total transfer

Indicates the cumulative bytes transferred for the relationship.

Bytes

 

Use the detailed diagnosis of the Details of volumes measure to know the protocol type and mount path of the volume, and the capacity pool to which the volume belongs.

Figure 3 : The detailed diagnosis of the Details of volumes measure