Google Cloud Composer Details Test

Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow, enabling you to create, schedule, monitor, and manage workflow pipelines that span across clouds and on-premises data centers. Cloud Composer can also enable you to create and manage data warehousing tasks, machine learning model, training workflows, and more, where complex sequences of tasks need to be orchestrated and automated. A workflow represents a series of tasks for ingesting, transforming, analyzing, or utilizing data, and is created using Directed Acyclic Graphs (DAGs). A DAG can be built and run using Apache Airflow. Each task in the DAG can represent any function such as preparing data for ingestion, monitoring an API, sending an email, running a pipeline, etc. To run workflows, an environment should be created first. As Airflow depends on many micro-services to run the workflow, the Cloud Composer provisions Google Cloud components to run the workflows. These components are collectively known as a Cloud Composer environment. As already mentioned, the workflow tasks are orchestrated to run in a complex sequence or parallel with other tasks with cascading impacts on workflow operations. Identifying any delay in running the tasks, excessive resource utilization and workload, and errors in Google Composer workflows is critical to ensure that the Cloud Composer activities are performed as anticipated. This can be easily achieved using the Google Cloud Composer Details test.

This test monitors the composer instances in the Google Cloud project and reports the current status and resource utilization for each instance. Also, this test sheds light on the task processing capability of each composer instance. This helps administrators to take necessary actions if the tasks are processed slowly.

Note:

This test will report metrics only if the Cloud Composer API is enabled in the target Google Cloud project. If you want to know how to turn on a service API in the Google Cloud project, refer Enabling Service APIs.

Target of the test : Google Cloud

Agent deploying the test : A remote agent

Outputs of the test : One set of results for each composer instance in the Google Cloud project

Configurable parameters for the test
Parameters	Description
Test Period	How often should the test be executed. By default, this is set to 180 minutes.
Host	The host for which the test is to be configured.
Get Location Mints	Specify the maximum time duration in minutes within which this test should connect to the instances across the various regions and report metrics. By default, this parameter is set to 60 minutes.
Private Keyfile Name	To connect to the Google Cloud Project in which the services are running, the eG agent requires a private key of a service account with Compute Viewer, Monitoring Viewer, and Cloud Asset Viewer roles in the target project. If a service account pre-exists in the project, then you can download the private key as a JSON file. Save this JSON file in the <eG_Install_Dir>/agent/lib folder and provide the name of that file against this parameter. However, if no such service account pre-exists, you will have to create one for monitoring the project. To know how to create a service account and download its private key, refer to How does eG Enterprise Monitor Google Cloud?.
Detailed Diagnosis	To make diagnosis more efficient and accurate, the eG Enterprise embeds an optional detailed diagnostic capability. With this capability, the eG agents can be configured to run detailed, more elaborate tests as and when specific problems are detected. To enable the detailed diagnosis capability of this test for a particular server, choose the On option. To disable the capability, click on the Off option. The option to selectively enable/disable the detailed diagnosis capability will be available only if the following conditions are fulfilled: The eG manager license should allow the detailed diagnosis capability Both the normal and abnormal frequencies configured for the detailed diagnosis measures should not be 0.

Measurements made by the test

Measurement

Description

Measurement Unit

Interpretation

Status

Indicates the current status of this Cloud Composer.

The values that this measure reports and their corresponding numeric values are listed below:

Measure value	Numeric value
Unknown	0
Creating	1
Running	2
Updating	3
Deleting	4
Error	5

Note:

This measure reports the Measure Values listed in the table above to indicate the current status of the Cloud Composer. In the graph of this measure however, the same is indicated using the numeric equivalents only.

The detailed diagnosis of this measure reveals the geo location of each composer, the time stamp at which the composer instance was created and the time at which the composer instance was last updated.

Composer API requests

Indicates the number of API requests received by this composer.

Number

Composer API call latency

Indicates the latency experienced by this composer while processing the API call requests.

A very low value is desired for this measure.

Parse error count

Indicates the number of errors occurred while parsing the DAG files in this composer.

Number

Ideally, non-zero value for this measure is a cause for concern.

DAG parsing processes

Indicates the number of currently running DAG parsing processes in this composer.

Number

Processors timeout count

Indicates how many times the processors timed out while processing the requests on this composer.

Number

Total parse time

Indicates the total time taken for performing parsing processes in this composer

Seconds

Dag bag size

Indicates the number of DAGs deployed to the bucket of your environment and processed by Airflow at a given time.

Number

A sudden/gradual increase in the value of this measure indicates the increase in the workload which may lead to performance degradation. You can use this value to analyze performance bottlenecks due to excessive workload.

CPU cores reserved

Indicates that the number of CPU cores reserved in this composer.

Number

CPU time

Indicates the amount of time that the CPU was busy performing the requests.

Seconds

CPU utilization

Indicates the percentage of CPU utilized by this composer.

Percent

Disk usage

Indicates the amount of disk space used by this composer.

Compare the value of this measure across the composer instances to find out which instance utilized maximum amount of disk space.

Disk quota

Indicates the amount of disk space that this composer can utilize.

Disk utilization

Indicates the percentage of disk space used by this composer.

Percent

A value close to 100 is a cause for concern.

Memory usage

Indicates the amount of memory utilized used by this composer.

Compare the value of this measure across the composer instances to find out which instance utilized the memory excessively.

Maximum quota

Indicates the amount of memory that this composer can utilize.

Memory utilization

Indicates the percentage of memory utilized by this composer.

Percent

A value close to 100 is a cause for concern.

Received bytes

Indicates the amount of bytes received by the database of this composer.

Compare the values of these measure across the composer instance to identify the instance that received/sent maximum/least amount of bytes.

Sent bytes

Indicates the amount of bytes sent from the database of this composer.

Is database healthy?

Indicates the current health status of the airflow database in this composer.

The values that this measure reports and their corresponding numeric values are listed below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate the current status of the database on each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Executor open slots

Indicates the number of executor slots that are currently open on this composer.

Number

The executor slots refers to the number of concurrent tasks that can be executed simultaneously by the Airflow scheduler across all worker nodes in the environment.

Executor queued tasks

Indicates the number of tasks that are currently queued to be executed by the executor on this composer.

Number

A high value for this measure indicates that the executor is overwhelmed, and additional resources such as more worker nodes or larger instance types, may be required to handle the current workload efficiently.

Executor running Tasks

Indicates the number of tasks that are currently being executed by the executor on this composer.

Number

If the value of this measure increases consistently, it indicates that the environment is operating near its total capacity, and you may need to allocate additional resources to handle the workload efficiently.

Finished task instance count

Indicates the number of task instances that have completed execution during the last measurement period.

Number

A task instance represents execution of a specific task within a DAG (Directed Acyclic Graph), which is a workflow in Airflow.

This measure is a good indicator of the execution status and completion rate of tasks within the workflows.

Is composer healthy?

Indicates the current health status of this composer.

The values that this measure reports and their corresponding numeric values are listed below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate the current status of each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Celery workers

Indicates the number of Celery workers that are currently available on this composer.

Number

Celery workers are processes responsible for executing tasks in Celery, a distributed task queue framework in Apache Airflow and Google Cloud Composer. They enable parallel and distributed execution of tasks, improve scalability and throughput, and ensure the reliability of workflow execution in distributed environments.

A low value for this measure may indicate that the composer is may not be having sufficient Celery workers to execute the tasks, and additional resources should be allocated to handle the task executions.

Scheduler heartbeats

Indicates the number of periodic signals sent by the scheduler component in this composer to indicate that it is alive and operational.

Number

The value of this measure is used to assess the health and availability of the scheduler on each composer.

A low value for this measure is an indication of potential issues with the scheduler.

Task queue length

Indicates the number of tasks that are currently queued up and awaiting execution by the Celery workers on this composer.

Number

If the value of this measure is consistently high, it may indicate that the composer is under heavy load and additional resources may be needed to handle the workload efficiently.

On the other hand, if the value of this measure is consistently low, it may indicate that the resources are underutilized.

Unfinished task instances

Indicates the task instances within a workflow that were not completed execution during the last measurement period.

Number

By tracking the value of this measure, you can identify the composer on which the tasks are taking longer than expected to complete or tasks that have encountered issues during execution.

Web server CPU reserved cores

Indicates the number of CPU cores reserved for the web server component of this composer.

Number

Web server CPU usage time

Indicates the amount of CPU time consumed by the web server component of this composer during the last measurement period.

Seconds

A high value for this measure may be a cause for concern.

Is web server healthy?

Indicates the current health status of the web server component of this composer.

The values that this measure reports and their corresponding numeric values are listed below:

Measure Value	Numeric Value
Yes	1
No	0

Note:

This measure reports the Measure Values listed in the table above to indicate the current status of the web server component of each composer. In the graph of this measure however, the same is indicated using the numeric equivalents only.

Web server memory usage

Indicates the amount of memory utilized used by the web server component of this composer.

Compare the value of this measure across the composer instances to find out which instance's web server component utilized the maximum memory resources.

Web server memory quota

Indicates the amount of memory that the web server component of this composer can utilize.

Maximum airflow workers

Indicates the maximum number of Airflow workers that this composer can utilize to execute concurrent tasks.

Number

Minimum airflow workers

Indicates the minimum number of Airflow workers that this composer should be contain to execute concurrent tasks.

Number

Worker pod eviction count

Indicates the number of times a worker pod was been evicted from this composer during the last measurement period.

Number

Workers scale factor target

Indicates the scaling factor target of this composer.

Number

Using the value reported for this measure, administrators can automatically scale the number of workers in the environment.

The value of this measure is calculated based on the current number of workers, number of Celery tasks in the Celery queue, that are not assigned to a worker and the number of idle workers.

Zombie tasks killed

Indicates the number of tasks that were terminated or killed due to being in a zombie state.

Number

A zombie task refers to a task that is stuck in an unexpected or undefined state, such as being queued indefinitely without being picked up for execution, or running indefinitely without completing or failing.

Workflow runs

Indicates the number of workflow runs initiated by a DAG in this composer.

Number

Workflow run duration

Indicates the total time taken for the execution of workflow runs in this composer.

Seconds

Workflow tasks

Indicates the number of workflow tasks running in this composer.

Number

Workflow tasks duration

Indicates the total time taken for the execution of workflow tasks in this composer.

Seconds