Monitoring the Microsoft Azure Subscription

eG Enterprise provides a specialized Microsoft Azure Subscription model for monitoring a single Azure subscription.

Figure 4 : Layer model of a Microsoft Azure Subscription component

Each layer in Figure 4 is mapped to tests that report on the health, availability, and performance of key services offered by an Azure subscription, and the usage of resources allocated to that subscription. Using these metrics, administrators can find quick and accurate answers for the following performance queries:

Monitoring Category

What is Revealed?

Cloud availability

  • Is the Microsoft Azure cloud accessible over the network?

  •  If so, how quickly is it responding to requests?

Subscription status and activity levels

  • Is the monitored Azure Subscription enabled?

  • If so, which GEO locations and resources have been allocated to this subscription?

  • Are any of these resources inactive?

  • Has any resource used up / is about to exhaust its usage quota? Which resource is this? Where is this resource located and what is the resource provider?

Service status

Are any services in a Critical/Warning state? Which services are they?

Cloud spend

  • Which service has been billed the highest this month? Why? Is it because one/more resources used by the service have overshot their budget? Which resources are they?

  • Has the target subscription been billed abnormally high this month? If so, which specific meters have been expensive? Which specific resource groups, resources, services, and regions are contributing to these mounting costs?

Activity Logs

Have any critical/warning/error events been logged in the Activity Logs?

Azure Backup Service

  • Did any backup jobs triggered by the Azure Backup service fail?

  • Which jobs failed?

  • Which jobs have been running for too long a time?

Azure Batch Service

  • Is any instance of the Azure Batch Service in an Error state currently?

  • Are any compute nodes managed by the Azure Batch Service instance not usable for running tasks? If so, why?

Azure Cosmos DB

  • Is the database service delivered by any Azure Cosmos DB account unavailable currently?

  • Did any Azure Cosmos DB account respond to HTTP requests with errors? What type of errors were thrown and why?

  • Were requests throttled by an Azure Cosmos DB account owing to the lack of sufficient throughput? Is the account over-utilizing the provisioned throughput? If so, which type of operations are contributing to this - query? update? delete? insert? count? or others?

  • Were any request processing latencies in any Azure Cosmos DB account?

  • Is any Azure Cosmos DB account running out of storage space? If so, what type of objects are hogging storage - data objects? or index objects?

  • Did any request to the Azure Cosmos DB account fail? If yes, then what type of requests failed the most - query? update? delete? insert? count? or others?

  • Did any Azure Cosmos DB account fail to meet with its consistency guarantees?

Azure Event Hubs

  • Is any Event hub in the Failed state currently?

  • Have requests to any Event hub failed recently?

  • Has any Event hub encountered errors? If so, what type of errors?

Azure Firewall

  • Is any Azure Firewall in a Degraded state currently, owing to excessive SNAT port usage?

  • Did traffic traversing any Azure Firewall match one/more of the firewall rules? If so, which rules were hit?

Azure IoT hubs

  • Have devices to attached to any IoT hub rejected/abandoned cloud-to-device commands?

  • Are too many devices disconnected from an IoT hub?

  • Have messages been dropped by any IoT hub?

  • Were messages orphaned / undelivered by any IoT hub?

  • Was any IoT hub abnormally slow in delivering messages to any specific type of endpoint? If so, which endpoint is it - event hub endpoints? service bus queue endpoints? service bus endpoints? built-in endpoints? storage endpoints?

  • Did any IoT hub fail to process twin reads/updates/queries?

  • Did any IoT hub fail to process jobs?

  • Is any IoT hub experiencing throttling errors? If so, why? Is it because more than a permitted number of device-to-cloud telemetry messages are being attempted to be sent to this hub?

Azure Key Vaults

  • Is any Key Vault processing more requests than the rest? If so, what type of requests are contributing to this workload - requests for secrets? keys? certificates? or others?

  • Is any Key Vault unduly lethargic in processing requests? If so, what type of requests are being processed very slowly?

  • Have operations performed on any Key Vault failed?

Azure NetApp File Service

  • Are volumes in any NetApp capacity pool over-utilizing the storage capacity provisioned to the pool? If so, are snapshots hogging the space in the volumes?

  • Is any NetApp capacity pool about to exhaust its provisioned throughput?

  • Have all NetApp volumes been allocated adequate storage space, or is any volume running out of space? If it is the latter, then is the space crunch because of snapshots?

  • Is any volume experiencing I/O processing bottlenecks?

  • Is it taking an abnormally long time to replicate data from any volume?

Azure Redis Cache

  • Is any Azure Redis Cache slow in responding to requests?

  • Does any Azure Redis Cache have a poor hit ratio? If so, what could be causing that cache to perform so poorly - is it because the load on the cache server is high? is memory usage and memory fragmentation on the cache abnormally high?

  • Has any Azure Redis Cache blocked and/or rejected client connections to it?

  • Is the CPU usage of any Azure Redis Cache unusually high?

  • Is the bandwidth usage of all Azure Redis caches optimal?

Azure Service Bus

  • Is any Azure Service Bus overloaded with connections? If so, should the connection capacity of that bus be reset?

  • Are requests to any Azure Service Bus failing often? If so, what could be causing this - server errors? or user errors?

  • Are requests to any Azure Service Bus getting throttled because of the lack of enough throughput?

  • Were memory contentions noticed on any Azure Service Bus queue?

  • Has any Azure Service Bus queue moved its messages to the dead-letter queue?

Azure SQL Database

  • Is any Azure SQL Database instance using CPU excessively?

  • Is there a storage crunch on any Azure SQL Database instance?

  • Are all Azure SQL Database instances sized with sufficient DTUs (Database Transaction Units)?

  • Are connections to any Azure SQL Database instance deadlocked?

  • Is firewall blocking connections to any Azure SQL Database instance?

Azure Storage

  • Are all Azure Storage accounts tied to the target subscription, available currently?

  • Is any Azure Storage account about to run out of free storage space? If so, what type of storage is responsible for the contention - file storage? blob storage? table storage? or queue storage?

  • Is any Azure Storage account unduly slow in successfully responding to requests?

Azure Virtual Machine Scale Sets

  • Is auto-scaling not enabled for any Azure Virtual Machine Scale Set?

  • Are VM instances in any Azure Virtual Machine Scale Set consuming CPU excessively?

  • Are VM instances in the Azure Virtual Machine Scale Sets using bandwidth optimally? What about CPU credits and I/O resources? Are these resources used excessively?

  • In the scaling rules for Azure Virtual Machine Scale Sets, are thresholds for resource usage set according to the demand for the resource?

Azure Virtual Machines

  • Are any Azure Virtual Machines powered off currently? Were any Azure VMs removed recently?

  • Is any Azure Virtual Machine starved for CPU, memory, disk, network, and/or I/O resources?

Azure VPN Gateways

  • Has abnormal bandwidth usage been noticed on any Azure VPN Gateway? If so, what type of connections are hogging bandwidth - Point-to-Site connections? or Site-to-Site connections?

  • Are tunnels on any VPN gateway consuming bandwidth close to the Aggregate Throughput Benchmark of that gateway? Should this benchmark be reset?

  • Did tunnels on any VPN gateway drop mismatched packets?

  • Have any failure events been captured by any VPN Gateway's diagnostic logs? What are those events?

Azure App Service

  • Are users complaining that web applications deployed using the Azure Apps Service are not responding quickly to web requests? If so, which application's responsiveness is poor?

  • What is the cause of the sluggish response of a web application - is it because of TCP connection delas? is it because of poor throughput of the application? or is it because of processing delays experienced by the backend server of the application?

  • Were any HTTP error responses captured on any web application? What type of errors were these?

Logic Apps

  • Did workflow runs for any Azure Logic App fail?

  • Is any Azure Logic App slow in running workflows? If so, why - is it because of lethargic trigger execution? or latent actions?

  • Did any Azure Logic App fail to execute triggers or perform actions?

  • Were any triggers/actions throttled because of low throughput?

Recovery Services Vaults

  • Were any Critical or Warning alerts generated on any Azure Recovery Services Vault?

  • Does any Azure Recovery Services Vault contain VMs in a Critical/Warning state?

  • Did any backup and/or recovery jobs performed by a vault fail?

Click on the links below to know about each layer of  Figure 4 and the tests mapped to it.

The Azure Infrastructure Layer

The Azure Network Services Layer

The Azure Storage Layer

The Azure Compute Layer

The Azure Data Services Layer

The Internet of Things Layer

The Enterprise Integration Layer

The Azure Billing Layer