Monitoring the Alibaba Cloud

As mentioned previously, eG Enterprise offers a specialized Alibaba Cloud model for monitoring a configured Alibaba cloud account.

Figure 5 : Layer model of the Alibaba cloud component

Each layer that is part of this model is mapped to tests that pull a variety of metrics revealing the composition, health, and monthly cost of instances and services that the target account subscribes to. Using these metrics, administrators can find quick and accurate answers for the following performance queries:

  • Is the cloud available? If so, how quickly is it responding to web requests?
  • How many regions and zones are managed by the monitored cloud account? Which are they? Is any zone unavailable currently?
  • Has any SSL certificate expired?
  • Are any SSL certificates nearing expiring? Which ones are they, and which cloud services will be impacted by the expiry?
  • Is any ECS instance powered-off? If so, which instance is it?
  • Were any ECS instances recently added/removed from the monitored cloud account? Which ones are they?
  • Is any ECS instance in any abnormal state currently?
  • Is any ECS instance consuming bandwidth resources excessively? If so, when is bandwidth consumption by that instance the maximum - when processing incoming traffic? or outgoing traffic? What type of traffic is consuming maximum bandwidth - internet traffic? or intranet traffic?
  • Is any ECS instance consuming CPU resources excessively?
  • Is any ECS instance experiencing a high level of disk I/O? If so, then what type of operations are I/O-intensive - read operations? or write operations?
  • Is any burstable ECS instance spending too many CPU credits?
  • Are too many CPU processes of any ECS instance waiting for I/O operations to complete?
  • Is any ECS instance very busy processing its workload?
  • Are all ECS instances sized with adequate memory resources?
  • Are any Server Load Balancer (SLB) instances inactive or locked?
  • Is any SLB instance managing too many faulty backend ECS instances?
  • Is any SLB instance experiencing a connection overload? Have any connections to this instance been dropped as a result? If there are idle connections to this instance, can they be terminated to prevent the connection drops?
  • Have data/packets in transit been dropped by any SLB instance? If so, which instance is it?
  • Is any SLB instance responding slowly to HTTP/S queries?
  • Is any SLB instance returning many error responses to HTTP/S requests?
  • What is the cloud spend for the current month? Is it higher than last month's spend? If so, which service, region, and instance is contributed to the escalation in cost?
  • Is the Alibaba Content Delivery Network (CDN) service functioning optimally? Are there accelerated domains for which configuration or checking has failed?
  • Which domain is overloaded with content acceleration requests? Are a majority of these requests serviced by the cached resources on the CDN nodes? or by origin servers?
  • Is any RDS instance in an abnormal health state presently?
  • Is any RDS instance locked?
  • Are all RDS instances available?
  • Is any RDS instance using up disk space excessively? What type of files are hogging the disk space - data files? log files? cold backups? SQL data?
  • Is any RDS instance overloaded with I/O requests?
  • Is any RDS instance close to exhausting its connection capcity?
  • Are all RDS instances sized with adequate CPU and memory resources?
  • Is any Redis instance unavailable currently?
  • Were any errors noticed on any Redis instance?
  • Were query processing latencies noticed in any Redis instance? Is it because that instance has not been granted adequate query processing power?
  • Were backup failures captured on any Redis instance?
  • Is any Redis instance running out of memory and/or CPU resources?
  • Is there a Redis instance that has exhausted or is about to exhaust its connection capacity?
  • Which Redis instance is using up bandwidth excessively and when - when performing read operations? or write operations?
  • Is any Redis instance performing reads and/or writes slowly?
  • Is any MySQL instance processing queries at a lethargic pace? What type of queries/statements are contributing to this slowness - insert statements? delete statements? update statements? select statements? replace_select statements? replace statements?
  • Is the buffer pool of any MySQL instance under-utilized?
  • Are too many dirty data blocks found in the buffer pool of any MySQL instance?
  • Are latencies noticed when writing to or reading from the buffer pool of any MySQL instance? If so, which instance is it?
  • Are fsync() calls slow on any instance?
  • Are SQL server instances on the cloud processing queries quickly? If not, which instance is experiencing slowness in query processing? How is the buffer cache usage on that instance? Is the sluggish query processing because most queries are serviced by direct disk acccesses and not by the buffer cache?
  • Is any SQL server instance performing many full table scans?
  • Are frequent lock timeouts occurring on any SQL server instance?
  • Were deadlock conditions noticed on any SQL server instance?
  • Is any SQL server instance experiencing frequent lock waits?
  • Is any SQL server instance utilizing storage space excessively? If so, which one is it?

The topics below discuss each layer of Figure 5, the tests mapped to it, and the measures reported by it.

The Alibaba Infrastructure Layer

The Alibaba Network Layer

The Alibaba Storage Layer

The Alibaba Elastic Computing Layer

The Alibaba Database Layer

The Alibaba Management and Monitoring Layer

The Alibaba Billing Layer