NetApp Aggregates Test

To support the differing security, backup, performance, and data sharing needs of your users, you group the physical data storage resources on your storage system into one or more aggregates. These aggregates provide storage to the volume or volumes that they contain. Each aggregate has its own RAID configuration, plex structure, and set of assigned disks or array LUNs.

Periodically, you must monitor the state, I/O activity, processing power, and space usage of each of the aggregates configured on your storage system, so that probable space contentions and I/O overloads can be rapidly detected, and failed/inconsistent/busy aggregates can be easily identified. Also, to be able to accurately point to failed checksum storage, problematic RAID groups, or issues in plex resynchronization in an aggregate, the key components of each aggregate - such as, RAID groups, plex structures and checksum disks - should also be monitored from time to time. The NetApp Aggregates test provides all these performance insights. This test auto-discovers the aggregates configured on a storage system, and periodically reports the following:

  • What is the current state of each aggregate?
  • Which are the busy aggregates?
  • Is any aggregate running short of storage space?
  • Is I/O load uniformly distributed across all aggregates, or is any aggregate overloaded with read-write requests?
  • What is the current status of the checksum storage in each aggregate?
  • What is the current status of the plex structures in each aggregate?
  • Are the RAID groups in an aggregate in a normal state?
  • Did any aggregate experience issues during plex resynchronization?

Target of the test : A NetApp Unified Storage

Agent deploying the test : An external/remote agent

Outputs of the test : One set of results for each aggregate on the NetApp storage system being monitored.

Configurable parameters for the test
Parameters Description

Test Period

How often should the test be executed.

Host

The host for which the test is to be configured.

Port

Specify the port at which the specified host listens in the Port text box. By default, this is NULL.

User

Here, specify the name of the user who possesses the following privileges:

login-http-admin,api-aggr-check-spare-low,api-aggr-list-info,api-aggr-mediascrub-list-info,api-aggr-scrub-list-info,api-cifs-status,api-clone-list-status,api-disk-list-info,api-fcp-adapter-list-info,api-fcp-adapter-stats-list-info,api-fcp-service-status,api-file-get-file-info,api-file-read-file,api-iscsi-connection-list-info,api-iscsi-initiator-list-info,api-iscsi-service-status,api-iscsi-session-list-info,api-iscsi-stats-list-info,api-lun-config-check-alua-conflicts-info,api-lun-config-check-cfmode-info,api-lun-config-check-info,api-lun-config-check-single-image-info,api-lun-list-info,api-nfs-status,api-perf-object-get-instances-iter*,api-perf-object-instance-list-info,api-quota-report-iter*,api-snapshot-list-info,api-vfiler-list-info,api-volume-list-info-iter*.

If such a user does not pre-exist, then, you can create a special user for this purpose using the steps detailed in Creating a New User with the Privileges Required for Monitoring the NetApp Unified Storage.

Password

Specify the password that corresponds to the above-mentioned User.

Confirm Password

Confirm the Password by retyping it here.

Authentication Mechanism

In order to collect metrics from the NetApp Unified Storage system, the eG agent connects to the ONTAP management APIs over HTTP or HTTPS. By default, this connection is authenticated using the LOGIN_PASSWORD authentication mechanism. This is why, LOGIN_PASSWORD is displayed as the default authentication mechanism.

Use SSL

Set the Use SSL flag to Yes, if SSL (Secured Socket Layer) is to be used to connect to the NetApp Unified Storage System, and No if it is not.

API Port

By default, in most environments, NetApp Unified Storage system listens on port 80 (if not SSL-enabled) or on port 443 (if SSL-enabled) only. This implies that while monitoring the NetApp Unified Storage system, the eG agent, by default, connects to port 80 or 443, depending upon the SSL-enabled status of the NetApp Unified Storage system - i.e., if the NetApp Unified Storage system is not SSL-enabled (i.e., if the Use SSL flag above is set to No), then the eG agent connects to the NetApp Unified Storage system using port 80 by default, and if the NetApp Unified Storage system is SSL-enabled (i.e., if the Use SSL flag is set to Yes), then the agent-NetApp Unified Storage system communication occurs via port 443 by default. Accordingly, the API Port parameter is set to default by default.

In some environments however, the default ports 80 or 443 might not apply. In such a case, against the API Port parameter, you can specify the exact port at which the NetApp Unified Storage system in your environment listens, so that the eG agent communicates with that port for collecting metrics from the NetApp Unified Storage system.

vFilerName

A vFiler is a virtual storage system you create using MultiStore, which enables you to partition the storage and network resources of a single storage system so that it appears as multiple storage systems on the network. If the NetApp Unified Storage system is partitioned to accommodate a set of vFilers, specify the name of the vFiler that you wish to monitor in the vFilerName text box. In some environments, the NetApp Unified Storage system may not be partitioned at all. In such a case, the NetApp Unified Storage system is monitored as a single vFiler and hence the default value of none is displayed in this text box.

Timeout

Specify the duration (in seconds) beyond which the test will timeout if no response is received from the device. The default is 120 seconds.

Transfers Threshold

You can set a threshold value for the rate at which the transfers are serviced by an aggregate. Specifying such a value in the Transfers Threshold text box implies that the aggregates violating this threshold value will be termed as Busy aggregates. The default value is 15 (Transfers/Sec). This parameter is deprecated in v5.6.5 (and above).

DD Frequency

Refers to the frequency with which detailed diagnosis measures are to be generated for this test. The default is 1:1. This indicates that, by default, detailed measures will be generated every time this test runs, and also every time the test detects a problem. You can modify this frequency, if you so desire. Also, if you intend to disable the detailed diagnosis capability for this test, you can do so by specifying none against DD frequency.

Detailed Diagnosis

To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option.

The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:

  • The eG manager license should allow the detailed diagnosis capability
  • Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.
Measurements made by the test
Measurement Description Measurement Unit Interpretation

NetApp aggregates

Indicates the number of busy aggregates in the storage system.

Number

This measure is applicable only to the Busy Aggregates descriptor.

The detailed diagnosis capability of this measure, if enabled, lists out the name of the aggregate and the Transfer rate of each aggregate i.e., the rate at which data transfer is serviced by an aggregate.

This measure is deprecated in v5.6.5 (and above).

State

Indicates the current state of this aggregate.

 

The values that this measure can report and their corresponding numeric values have been listed in the table below. A brief description for each Measure Value is also provided:

Measure Value Numeric Value Description
Creating 1  
Online 2 Read and write access to volumes hosted on this aggregate is allowed.
Restricted 3 Some operations, such as parity reconstruction, are allowed, but data access is not allowed.
Iron Restricted 4 A WAFL consistency check is being performed on the aggregate.
Partial 5 At least one disk was found for the aggregate, but two or moredisks are missing.
Offline 6 No access to the aggregate is allowed.
Failed 7  
Unknown 8  

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of an aggregate. However, in the graph of this measure, states will be represented using the corresponding numeric equivalents i.e., 1 to 8.

Is aggregate inconsistent?

Indicates whether/not this aggregate is inconsistent.

 

One of the reasons why an aggregate is marked as inconsistent or corrupted, is when the Lost write protection feature detects an issue. Lost write protection is a feature of Data ONTAP that occurs on each WAFL read. Data is checked against block checksum information (WAFL context) and RAID parity data. If an issue is detected, there are two possible outcomes:

  1. The drive containing the data is failed.
  2. The aggregate containing the data is marked inconsistent.

If an aggregate is marked inconsistent, it will require the use of WAFL iron to be able to return the aggregate to a consistent state.

This measure indicates a value of Yes if the aggregate is inconsistent and the value No if the aggregate is not inconsistent. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
Yes 1
No 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether/not this aggregate is inconsistent. However, in the graph of this measure, the inconsistent state of an aggregate will be represented using the corresponding numeric equivalents i.e., 1 or 2.

 

Mirror status:

Indicates the current mirror status of this aggregate.

 

The values that this measure can report and their corresponding numeric values have been listed in the table below. A brief description for a few Measure Values is also provided:

Measure Value Numeric Value Description
Unmirrored 1 The aggregate is not mirrored. Unmirrored aggregates have only one plex (copy of their data), which contains all of the RAID groups belonging to that aggregate.
Mirrored 2 The aggregate is mirrored. Mirrored aggregates have two plexes (copies of their data), which use the SyncMirror functionality to duplicate the data to provide redundancy
Mirror Resynchronizing 3 One of the mirrored aggregate's plexes is being resynchronized
Un Initialized 4  
CP Count Check In Progress 5 WAFL consistency check is in progress
Needs CP Count Check 6 WAFL consistency check needs to be performed on the aggregate
Mirror Degraded 7 The aggregate is mirrored and one of its plexes is offline or resynchronizing
Invalid 8 The aggregate contains no volumes and none can be added. Typically this happens only after an aborted aggr copy operation.
Failed 9  
Limbo 10  

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current mirror status of this aggregate in this storage system. However, in the graph of this measure, the mirror status will be represented using the corresponding numeric equivalents - i.e., 1 to 10.

Is Raid state abnormal?

Indicates whether/not the RAID of this aggregate is in an abnormal state currently.

 

 

This measure indicates a value of Yes if the RAID of this aggregate is in an abnormal state and the value No if the RAID of this aggregate is normal. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
Yes 1
No 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the RAID of this aggregate is in an abnormal state. However, in the graph of this measure, the RAID states will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Checksum status

Indicates the current checksum status of this aggregate.

 

The values that this measure can report and their corresponding numeric values have been listed in the table below.

Measure Value Numeric Value
Active 1
Off 2
Reverting 3
None 4
Unknown 5
Initializing 6
Reinitializing 7
Reinitialized 8
Upgrading Phase1 9
Upgrading Phase2 10

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current checksum status of this aggregate. However, the graph of this measure will be represented using the corresponding numeric equivalents i.e., 1 to 10.

Are plexes offline?

Indicates whether/not the plexes in this aggregate are currently offline.

 

A plex is a collection of one or more RAID groups that together provide the storage for one or more WAFL file system volumes. Data ONTAP uses plexes as the unit of RAID-level mirroring when the SyncMirror feature is enabled. All RAID groups in one plex are of the same level, but may have a different number of disks.

This measure reports the value Yes if the plexes in this aggregate are currently offline and the value No if the plexes are not offline. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
No 1
Yes 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the plexes in this aggregate are currently offline or not. However, in the graph of this measure, the state of the plexes will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Are plexes resyncing?

Indicates whether/not the plexes of this aggregate are currently being resynchronized.

 

Plex resynchronization is a process that ensures two plexes of a mirrored aggregate have exactly the same data. When plexes are unsynchronized, one plex contains data that is more up to date than that of the other plex. Plex resynchronization updates the out-of-date plex so that both plexes are identical.

Data ONTAP resynchronizes the two plexes of a mirrored aggregate if one of the following situations occurs:

  • One of the plexes was taken offline and then brought online later.
  • You add a plex to an unmirrored aggregate.

This measure reports the value Yes if the plexes in this aggregate are currently resyncing and the value No if the plexes are not resyncing. The numeric values that correspond to the above-mentioned values are detailed in the table below:

Measure Value Numeric Value
No 1
Yes 2

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether the plexes in this aggregate are currently offline or not. However, in the graph of this measure, the state of the plexes will be represented using the corresponding numeric equivalents i.e., 1 or 2.

Total size

Indicates the total usable size of this aggregate.

MB

The size of this aggregate excludes the WAFL reserve and the aggregate snapshot reserve. This measure will report a value of 0 if the aggregate is restricted or offline.

Aggregate used size

Indicates the amount of space that is currently used in this aggregate.

MB

This measure will report a value 0 if the aggregate is not usable i.e., offline.

Percentage size used

Indicates the percentage of space that is currently used in this aggregate.

Percent

A value close to 100% is an indication of space constraint in the aggregate.

Total files

Indicates the total number of files in this aggregate.

Number

 

Used files

Indicates the total number of files that are currently stored in this aggregate.

Number

 

Transfers

Indicates the rate at which the transfers are serviced by this aggregate.

Ops/Sec

Compare the value of this measure across aggregates to identify the busy aggregates.

User reads

Indicates the rate at which the read request from the user is serviced by this aggregate.

Ops/Sec

A consistent decrease in the value of this measure could indicate a bottleneck when processing read requests. Compare the value of this measure across aggregates to know which aggregates service read requests slowly.

User writes

Indicates the rate at which the write request from the user is serviced in this aggregate.

Ops/Sec

A consistent decrease in the value of this measure could indicate a bottleneck when processing write requests. Compare the value of this measure across aggregates to know which aggregates are servicing write requests slowly.

CP reads

Indicates the rate at which the read request from the user is serviced during a Consistency Point (CP) operation in this aggregate.

Ops/Sec

A consistent decrease in the value of this measure could indicate that CP operations are slowing down the processing of read requests.

Block read rate

Indicates the rate at which the blocks are read from this aggregate upon a user request.

Ops/Sec

A consistent decrease in the value of this measure could indicate a bottleneck when processing read requests. Compare the value of this measure across aggregates to know which aggregates service block read requests slowly.

Block write rate

Indicates the rate at which the blocks are written to this aggregate upon a user request.

Ops/Sec

A consistent decrease in the value of this measure could indicate a bottleneck when processing write requests. Compare the value of this measure across aggregates to know which aggregates are servicing block write requests slowly.

Block read rate during CP

Indicates the rate at which the blocks are read from this aggregate during a Consistency point (CP) operation.

Ops/Sec

A consistent decrease in the value of this measure could indicate that CP operations are slowing down the processing of read requests.