Job Status Test
A job is any asynchronous task performed on the NetApp Cluster. Jobs are typically long-running volume operations such as copy, move, and mirror. You can monitor, pause, stop, and restart jobs, and configure them to run on specified schedules.
There are three categories of jobs that you can manage: server-affiliated, cluster-affiliated, and private.
A job can be in any of the following categories:
- Server-Affiliated jobs: These jobs are queued by the management framework to a specific node to be run.
- Cluster-Affiliated jobs: These jobs are queued by the management framework to any node in the cluster to be run.
- Private jobs: These jobs are specific to a node and do not use the replicated database (RDB) or any other cluster mechanism.
Jobs are placed into a job queue and run when resources are available. If the jobs in the job queue are not processed quickly, it would result in an overload condition characterized by long-winding job queues thus leading to the slowdown of the NetApp Cluster. In the event of such abnormalities, administrators will have to instantly figure out which type of jobs are contributing to the overload and why – is it because jobs of this type are failing frequently owing to errors? Or is it because the Cluster is not adequately configured to handle these jobs? The Job Status test helps administrators answer these questions!
This test auto-discovers the type of jobs in queue, and for each job type, reports the count of jobs that were successful, running, rescheduled, failed etc. This way, the test sheds light on job types that fail often, those that are taking too long to complete, and the probable reasons for the same.
Target of the test : A NetApp Cluster
Agent deploying the test : An external/remote agent
Outputs of the test : One set of results for each job type on the NetApp Cluster that is being monitored.
Parameters | Description |
---|---|
Test Period |
How often should the test be executed. |
Host |
The IP address of the storage controller cluster. |
Port |
Specify the port at which the specified host listens in the Port text box. By default, this is NULL. |
User |
Here, specify the name of the user who possesses the readonly role. If such a user does not pre-exist, then, you can create a special user for this purpose using the steps detailed in Creating a New User with the Role Required for Monitoring the NetApp Cluster. |
Password |
Specify the password that corresponds to the above-mentioned User. |
Confirm Password |
Confirm the Password by retyping it here. |
Authentication Mechanism |
In order to collect metrics from the NetApp Cluster, the eG agent connects to the ONTAP management APIs over HTTP or HTTPS. By default, this connection is authenticated using the LOGIN_PASSWORD authentication mechanism. This is why, LOGIN_PASSWORD is displayed as the default Authentication Mechanism. |
Use SSL |
Set the Use SSL flag to Yes, if SSL (Secured Socket Layer) is to be used to connect to the NetApp Unified Storage System, and No if it is not. |
API Report |
By default, in most environments, NetApp Cluster listens on port 80 (if not SSL-enabled) or on port 443 (if SSL-enabled) only. This implies that while monitoring the NetApp Cluster, the eG agent, by default, connects to port 80 or 443, depending upon the SSL-enabled status of the NetApp Cluster - i.e., if the NetApp Cluster is not SSL-enabled (i.e., if the Use SSL flag above is set to No), then the eG agent connects to the NetApp Cluster using port 80 by default, and if the NetApp Cluster is SSL-enabled (i.e., if the Use SSL flag is set to Yes), then the agent-NetApp Cluster communication occurs via port 443 by default. Accordingly, the API Port parameter is set to default by default. In some environments however, the default ports 80 or 443 might not apply. In such a case, against the API Port parameter, you can specify the exact port at which the NetApp Cluster in your environment listens, so that the eG agent communicates with that port for collecting metrics from the NetApp Cluster. |
Exclude Aggregates |
If you wish to exclude certain aggregates from the scope of monitoring, specify a list of comma-separated aggregates in this text box. By default, none will be displayed here. |
Records Per Call |
The eG agent by default, executes the API commands in order to query the aggregates in the target environment. In critical infrastructures spanning large number of aggregates, a single execution by the eG agent may query(or download) a sizeable amount of monitoring data, thereby adding to the cluster load. To avoid this, you can tweak the Records Per Call parameter to enable the eG agent to obtain monitoring data iteratively in chunks instead of retrieving the entire amount of monitoring data in a single go. Say for example, the eG agent is required to query 1000 aggregates, then specifying the value 100 in this text box will enable the eG agent to query 100 aggregates at a time for 10 times to obtain monitoring data from all the aggregates. By default, the value of this parameter is 10. |
Timeout |
Specify the duration (in seconds) beyond which the test will timeout if no response is received from the device. The default is 120 seconds. |
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation |
---|---|---|---|
Success jobs |
Indicates the number of jobs of this job type that were completed successfully. |
Number |
A high value is desired for this measure. |
Initial jobs |
Indicates the number of jobs of this job type that had been created but yet to be queued. |
Number |
|
Running jobs |
Indicates the number of jobs of this job type that ran upon picked by an instance of the Job Manager. |
Number |
|
Waiting jobs |
Indicates the number of jobs of this job type that were waiting for another job to complete. |
Number |
A high value for this measure is an indication of an endlessly running job which needs to be terminated failing which there may be a performance bottleneck. |
Queued jobs |
Indicates the number of jobs of this job type that were queued for execution. |
Number |
Queued jobs could be run immediately or may be scheduled to run at a later time. |
Pausing jobs |
Indicates the number of jobs of this job type that were in the process of pausing after being requested to pause. |
Number |
|
Paused jobs |
Indicates the number of jobs of this job type that were paused indefinetely. |
Number |
|
Quitting jobs |
Indicates the number of jobs of this job type that had been requested to terminate and were shutting down. |
Number |
|
Quit jobs |
Indicates the number of jobs of this job type that had been requested to terminate. |
Number |
|
Reschedule jobs |
Indicates the number of jobs of this job type that were rescheduled. |
Number |
|
Error jobs |
Indicates the number of times internal error occurred while processing the jobs of this job type. |
Number |
Ideally, the value of this measure should be zero. The detailed diagnosis of this measure if enabled, lists the name of the vServer, the name of the Job, the priority of the job, description of the job and the progress of the job. |
Failure jobs |
Indicates the number of jobs of this job type that failed to execute. |
Number |
A low value is desired for this measure. The detailed diagnosis of this measure if enabled, lists the name of the vServer, the name of the Job, the priority of the job, description of the job and the progress of the job. |
Dead jobs |
Indicates the number of jobs of this job type that exceeded the drop dead time and are being removed from the queue. |
Number |
The detailed diagnosis of this measure if enabled, lists the name of the vServer, the name of the Job, the priority of the job, description of the job and the progress of the job. |
Unknown jobs |
Indicates the number of jobs of this job type that were in the Unknown state. |
Number |
The detailed diagnosis of this measure if enabled, lists the name of the vServer, the name of the Job, the priority of the job, description of the job and the progress of the job. |
Restart jobs |
Indicates the number of jobs of this job type that were restarted. |
Number |
|
Dormant jobs |
Indicates the number of jobs of this job type that were inactive while waiting on some external event. |
Number |
|