Exadata Cell System Test

Storage cells are configured on the network, and are managed by the Oracle Exadata System Software CellCLI utility. Storage servers contain cell-based utilities and processes from Oracle Exadata System Software, including:

  • Cell Server (CELLSRV) - This is the primary component of the Oracle Exadata System Software running in the storage server, which provides the majority of the storage server services. CELLSRV services database requests for disk I/O and provides the advanced SQL offload capabilities.

  • Offload Server (CELLOFLSRV) - This is a helper process to the Cell Server that processes offload requests from a specific Database version. These processes allow the Storage server to respond to requests from multiple database versions residing on the same or multiple Database servers.

  • Management Server (MS) - The primary interface to administer, manage and query the status of the storage server. It works in cooperation with the Cell Control Command-Line Interface (CellCLI) and processes most of the commands from CellCLI.

  • Restart Server (RS) - Monitors the heartbeat with the MS and the CELLSRV processes, and restarts the servers if they fail to respond within the allowable heartbeat period.

If any of the cell-based utilities are unavailable/offline/stopped, then the functioning of the storage cell may slow down resulting in poor I/O processing. Also, a sudden hardware failure or an increase in the temperature of the storage cell may result in malfunctioning of the storage cell. To avoid such serious damages and to ensure that the storage cell is functioning at its peak efficiency, it is essential to keep a constant vigil on the performance of the storage cell. This is where the Exadata Cell System test helps!

This test monitors the status of the storage cell. This test also monitors the cell-based utilities of the storage cell and reports the utilities that are offline or stopped. Failure of the hardware components (power supply, fan) are proactively detected and reported. The physical memory utilization and CPU utilization of the cell server and management server helps administrators figure out the server that is consuming too much of resources.

Target of the test : Oracle Exadata Storage Server

Agent deploying the test : A remote agent

Outputs of the test : One set of results for the target Oracle Exadata Storage Server that is being monitored

Configurable parameters for the test
Parameter Description

Test period

How often should the test be executed

Host

The IP address of the host for which this test is to be configured.

Port

The port number at which the specified host listens. By default, this is NULL.

Username, Password and Confirm Password

By default, this test uses the Cell Control Command-Line Interface (CellCLI) to pull out the required metrics. To use the CLI, the test first needs to connect to the target storage server via SSH, and then run commands using CLI. For running the commands, this test requires the credentials of a cellmonitor user. Specify the login credentials of such a user in the Username and Password text boxes and confirm the Password by retyping it in the Confirm Password text box.

SSH Port

This test uses the Cell CLI to pull metrics from the target Oracle Exadata Storage Server. To run the CLI commands, this test first needs to establish an SSH connection with the target storage server. To enable the test to establish this connection, specify the SSH Port here.

Timeout

 Specify the time duration for which this test should wait for a response from the storage system in the Timeout text box. By default, this is 120 seconds.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Cell status

Indicates the current status of the storage cell or target storage server.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Offline 0
Online 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the storage cell. However, in the graph of this measure, the status of the storage cell will be represented using the corresponding numeric equivalents only - i.e., 0 or 100.

Fan status

Indicates the current status of the fan operating in the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the fan. However, in the graph of this measure, the status of the fan will be represented using the corresponding numeric equivalents mentioned in the table above.

Temperature status

Indicates the current temperature status of the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current temperature status of the storage cell. However, in the graph of this measure, the temperature status will be represented using the corresponding numeric equivalents mentioned in the table above.

Power status

Indicates the current power status of the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Normal 100
Warning 90
Critical 50

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current power status of the storage cell. However, in the graph of this measure, the power status will be represented using the corresponding numeric equivalents mentioned in the table above.

Cell server status

Indicates the current status of the cell server in the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the cell server. However, in the graph of this measure, the status of the cell server will be represented using the corresponding numeric equivalents mentioned in the table above.

Management server status

Indicates the current status of the management server in the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the management server. However, in the graph of this measure, the status of the management server will be represented using the corresponding numeric equivalents mentioned in the table above.

Restart server status

Indicates the current status of the Restart server in the storage cell.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Stopped 0
Running 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the Restart server. However, in the graph of this measure, the status of the Restart server will be represented using the corresponding numeric equivalents mentioned in the table above.

Locator LED status

Indicates the current status of the Locator LED.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
Off 0
On 100

Note:

By default, this measure reports the above-mentioned Measure Values while indicating the current status of the Locator LED. However, in the graph of this measure, the status of the Locator LED will be represented using the corresponding numeric equivalents mentioned in the table above.

Uptime

Indicates the total time that the storage cell has been up since its last reboot.

Mins

Administrators may wish to be alerted if a storage cell has been running without a reboot for a very long period. Setting a threshold for this metric allows administrators to determine such conditions.

Uptime since last measure

the time period that the storage cell has been up since the last time this test ran.

Secs

If the storage cell has not been rebooted during the last measurement period and the agent has been running continuously, this value will be equal to the measurement period. If the storage cell was rebooted during the last measurement period, this value will be less than the measurement period of the test. For example, if the measurement period is 300 secs, and if the storage cell was rebooted 120 secs back, this metric will report a value of 120 seconds.  The accuracy of this metric is dependent on the measurement period – the smaller the measurement period, greater the accuracy.

Is restarted?

Indicates whether/not the storage cell was restarted.

 

The table below indicates the values that this measure can report and their corresponding numeric equivalents:

Measure value Numeric Value
No 0
Yes 1

Note:

By default, this measure reports the above-mentioned Measure Values while indicating whether/not the storage cell was restarted. However, the graph of this measure will be represented using the corresponding numeric equivalents only.

Battery charge on disk controller

Indicates the percentage of battery charge on the disk controller.

Percent

A sudden/gradual decrease in the value of this measure indicates that the battery resource of the disk controller is depleting at a faster pace and the battery needs to be recharged/replaced.

Temperature of disk controller

Indicates the current temperature of the disk controller.

Celsius

The temperature of the disk controller should always be maintained in admissible range. A sudden/gradual increase in the temperature results in over heating of the disk controller and eventually causes the storage server to malfunction.

Temperature of cell

Indicates the current temperature of the storage cell.

Celsius

Ideally, the value of this measure should be within admissible range.

A sudden/gradual increase in the value of this measure results in over heating of the storage cell and eventually causes the storage cell to malfunction.

Physical memory utilization

Indicates the overall percentage of physical memory utilized by the storage cell.

Percent

A value close to 100 is a cause of concern and warrants further investigation.

Physical memory utilized by cell server

Indicates the percentage of physical memory utilized by the cell server.

Percent

A high value for this measure indicates that the cell server is consuming too much of physical memory.

Physical memory utilized by management server

Indicates the percentage of physical memory utilized by the management server.

Percent

A high value for this measure indicates that the management server is consuming too much of physical memory.

Swap memory usage

Indicates the percentage of swap memory utilized by the storage cell.

Percent

 

Virtual memory utilized by cell server

Indicates the amount of virtual memory utilized by the cell server.

GB

 

Total memory utilized by management server

Indicates the total amount of memory utilized by the management server.

GB

 

CPU utilization

Indicates the percentage of CPU utilized by the storage cell.

Percent

A value close to 100 is a cause of concern.

CPU utilized by cell server

Indicates the percentage of CPU utilized by the cell server.

Percent

A high value for this measure indicates that the cell server is consuming too much of CPU resources.

CPU utilized by management server

Indicates the percentage of CPU utilized by the management server.

Percent

A high value for this measure indicates that the management server is consuming too much of CPU resources.