Google Cloud Composer Details Test
Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow, enabling you to create, schedule, monitor, and manage workflow pipelines that span across clouds and on-premises data centers. Cloud Composer can also enable you to create and manage data warehousing tasks, machine learning model, training workflows, and more, where complex sequences of tasks need to be orchestrated and automated. A workflow represents a series of tasks for ingesting, transforming, analyzing, or utilizing data, and is created using Directed Acyclic Graphs (DAGs). A DAG can be built and run using Apache Airflow. Each task in the DAG can represent any function such as preparing data for ingestion, monitoring an API, sending an email, running a pipeline, etc. To run workflows, an environment should be created first. As Airflow depends on many micro-services to run the workflow, the Cloud Composer provisions Google Cloud components to run the workflows. These components are collectively known as a Cloud Composer environment. As already mentioned, the workflow tasks are orchestrated to run in a complex sequence or parallel with other tasks with cascading impacts on workflow operations. Identifying any delay in running the tasks, excessive resource utilization and workload, and errors in Google Composer workflows is critical to ensure that the Cloud Composer activities are performed as anticipated. This can be easily achieved using the Google Cloud Composer Details test.
This test monitors the composer instances in the Google Cloud project and reports the current status and resource utilization for each instance. Also, this test sheds light on the task processing capability of each composer instance. This helps administrators to take necessary actions if the tasks are processed slowly.
Note:
This test will report metrics only if the Cloud Composer API is enabled in the target Google Cloud project. If you want to know how to turn on a service API in the Google Cloud project, refer Enabling Service APIs.
Target of the test : Google Cloud
Agent deploying the test : A remote agent
Outputs of the test : One set of results for each composer instance in the Google Cloud project
Parameters | Description |
---|---|
Test Period |
How often should the test be executed. By default, this is set to 180 minutes. |
Host |
The host for which the test is to be configured. |
Get Location Mints |
Specify the maximum time duration in minutes within which this test should connect to the instances across the various regions and report metrics. By default, this parameter is set to 60 minutes. |
Private Keyfile Name |
To connect to the Google Cloud Project in which the services are running, the eG agent requires a private key of a service account with Compute Viewer, Monitoring Viewer, and Cloud Asset Viewer roles in the target project. If a service account pre-exists in the project, then you can download the private key as a JSON file. Save this JSON file in the <eG_Install_Dir>/agent/lib folder and provide the name of that file against this parameter. However, if no such service account pre-exists, you will have to create one for monitoring the project. To know how to create a service account and download its private key, refer to How does eG Enterprise Monitor Google Cloud?. |
Detailed Diagnosis |
To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled:
|
Measurement | Description | Measurement Unit | Interpretation | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Status |
Indicates the current status of this Cloud Composer. |
|
The values that this measure reports and their corresponding numeric values are listed below:
Note: This measure reports the Measure Values listed in the table above to indicate the current status of the Cloud Composer. In the graph of this measure however, the same is indicated using the numeric equivalents only. The detailed diagnosis of this measure reveals the geo location of each composer, the time stamp at which the composer instance was created and the time at which the composer instance was last updated. |
||||||||||||||
Composer API requests |
Indicates the number of API requests received by this composer. |
Number |
|
||||||||||||||
Composer API call latency |
Indicates the latency experienced by this composer while processing the API call requests. |
|
A very low value is desired for this measure. |
||||||||||||||
Parse error count |
Indicates the number of errors occurred while parsing the DAG files in this composer. |
Number |
Ideally, non-zero value for this measure is a cause for concern. |
||||||||||||||
DAG parsing processes |
Indicates the number of currently running DAG parsing processes in this composer. |
Number |
|
||||||||||||||
Processors timeout count |
Indicates how many times the processors timed out while processing the requests on this composer. |
Number |
|
||||||||||||||
Total parse time |
Indicates the total time taken for performing parsing processes in this composer |
Seconds |
|
||||||||||||||
Dag bag size |
Indicates the number of DAGs deployed to the bucket of your environment and processed by Airflow at a given time. |
Number |
A sudden/gradual increase in the value of this measure indicates the increase in the workload which may lead to performance degradation. You can use this value to analyze performance bottlenecks due to excessive workload. |
||||||||||||||
CPU cores reserved |
Indicates that the number of CPU cores reserved in this composer. |
Number |
|
||||||||||||||
CPU time |
Indicates the amount of time that the CPU was busy performing the requests. |
Seconds |
|
||||||||||||||
CPU utilization |
Indicates the percentage of CPU utilized by this composer. |
Percent |
|
||||||||||||||
Disk usage |
Indicates the amount of disk space used by this composer. |
MB |
Compare the value of this measure across the composer instances to find out which instance utilized maximum amount of disk space. |
||||||||||||||
Disk quota |
Indicates the amount of disk space that this composer can utilize. |
MB |
|
||||||||||||||
Disk utilization |
Indicates the percentage of disk space used by this composer. |
Percent |
A value close to 100 is a cause for concern. |
||||||||||||||
Memory usage |
Indicates the amount of memory utilized used by this composer. |
MB |
Compare the value of this measure across the composer instances to find out which instance utilized the memory excessively. |
||||||||||||||
Maximum quota |
Indicates the amount of memory that this composer can utilize. |
MB |
|
||||||||||||||
Memory utilization |
Indicates the percentage of memory utilized by this composer. |
Percent |
A value close to 100 is a cause for concern. |
||||||||||||||
Received bytes |
Indicates the amount of bytes received by the database of this composer. |
MB |
Compare the values of these measure across the composer instance to identify the instance that received/sent maximum/least amount of bytes. |
||||||||||||||
Sent bytes |
Indicates the amount of bytes sent from the database of this composer. |
MB |
|||||||||||||||
Is database healthy? |
Indicates the current health status of the airflow database in this composer. |
|
The values that this measure reports and their corresponding numeric values are listed below:
Note: This measure reports the Measure Values listed in the table above to indicate the current status of the database on each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only. |
||||||||||||||
Executor open slots |
Indicates the number of executor slots that are currently open on this composer. |
Number |
The executor slots refers to the number of concurrent tasks that can be executed simultaneously by the Airflow scheduler across all worker nodes in the environment. |
||||||||||||||
Executor queued tasks |
Indicates the number of tasks that are currently queued to be executed by the executor on this composer. |
Number |
A high value for this measure indicates that the executor is overwhelmed, and additional resources such as more worker nodes or larger instance types, may be required to handle the current workload efficiently. |
||||||||||||||
Executor running Tasks |
Indicates the number of tasks that are currently being executed by the executor on this composer. |
Number |
If the value of this measure increases consistently, it indicates that the environment is operating near its total capacity, and you may need to allocate additional resources to handle the workload efficiently. |
||||||||||||||
Finished task instance count |
Indicates the number of task instances that have completed execution during the last measurement period. |
Number |
A task instance represents execution of a specific task within a DAG (Directed Acyclic Graph), which is a workflow in Airflow. This measure is a good indicator of the execution status and completion rate of tasks within the workflows. |
||||||||||||||
Is composer healthy? |
Indicates the current health status of this composer. |
|
The values that this measure reports and their corresponding numeric values are listed below:
Note: This measure reports the Measure Values listed in the table above to indicate the current status of each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only. |
||||||||||||||
Celery workers |
Indicates the number of Celery workers that are currently available on this composer. |
Number |
Celery workers are processes responsible for executing tasks in Celery, a distributed task queue framework in Apache Airflow and Google Cloud Composer. They enable parallel and distributed execution of tasks, improve scalability and throughput, and ensure the reliability of workflow execution in distributed environments. A low value for this measure may indicate that the composer is may not be having sufficient Celery workers to execute the tasks, and additional resources should be allocated to handle the task executions. |
||||||||||||||
Scheduler heartbeats |
Indicates the number of periodic signals sent by the scheduler component in this composer to indicate that it is alive and operational. |
Number |
The value of this measure is used to assess the health and availability of the scheduler on each composer. A low value for this measure is an indication of potential issues with the scheduler. |
||||||||||||||
Task queue length |
Indicates the number of tasks that are currently queued up and awaiting execution by the Celery workers on this composer. |
Number |
If the value of this measure is consistently high, it may indicate that the composer is under heavy load and additional resources may be needed to handle the workload efficiently. On the other hand, if the value of this measure is consistently low, it may indicate that the resources are underutilized. |
||||||||||||||
Unfinished task instances |
Indicates the task instances within a workflow that were not completed execution during the last measurement period. |
Number |
By tracking the value of this measure, you can identify the composer on which the tasks are taking longer than expected to complete or tasks that have encountered issues during execution. |
||||||||||||||
Web server CPU reserved cores |
Indicates the number of CPU cores reserved for the web server component of this composer. |
Number |
|
||||||||||||||
Web server CPU usage time |
Indicates the amount of CPU time consumed by the web server component of this composer during the last measurement period. |
Seconds |
A high value for this measure may be a cause for concern. |
||||||||||||||
Is web server healthy? |
Indicates the current health status of the web server component of this composer. |
|
The values that this measure reports and their corresponding numeric values are listed below:
Note: This measure reports the Measure Values listed in the table above to indicate the current status of the web server component of each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only. |
||||||||||||||
Web server memory usage |
Indicates the amount of memory utilized used by the web server component of this composer. |
MB |
Compare the value of this measure across the composer instances to find out which instance's web server component utilized the maximum memory resources. |
||||||||||||||
Web server memory quota |
Indicates the amount of memory that the web server component of this composer can utilize. |
MB |
|
||||||||||||||
Maximum airflow workers |
Indicates the maximum number of Airflow workers that this composer can utilize to execute concurrent tasks. |
Number |
|
||||||||||||||
Minimum airflow workers |
Indicates the minimum number of Airflow workers that this composer should be contain to execute concurrent tasks. |
Number |
|
||||||||||||||
Worker pod eviction count |
Indicates the number of times a worker pod was been evicted from this composer during the last measurement period. |
Number |
|
||||||||||||||
Workers scale factor target |
Indicates the scaling factor target of this composer. |
Number |
Using the value reported for this measure, administrators can automatically scale the number of workers in the environment. The value of this measure is calculated based on the current number of workers, number of Celery tasks in the Celery queue, that are not assigned to a worker and the number of idle workers. |
||||||||||||||
Zombie tasks killed |
Indicates the number of tasks that were terminated or killed due to being in a zombie state. |
Number |
A zombie task refers to a task that is stuck in an unexpected or undefined state, such as being queued indefinitely without being picked up for execution, or running indefinitely without completing or failing. |
||||||||||||||
Workflow runs |
Indicates the number of workflow runs initiated by a DAG in this composer. |
Number |
|
||||||||||||||
Workflow run duration |
Indicates the total time taken for the execution of workflow runs in this composer. |
Seconds |
|
||||||||||||||
Workflow tasks |
Indicates the number of workflow tasks running in this composer. |
Number |
|
||||||||||||||
Workflow tasks duration |
Indicates the total time taken for the execution of workflow tasks in this composer. |
Seconds |
|