Using Azure Monitor for Monitoring Azure Virtual Desktop (AVD) and Estimating Your Costs
In my last blog, I wrote about how to set up Azure Monitor for WVD leveraging a Log Analytics Workspace. In this next installment I’ll cover how you can proceed from this point and add in thresholds on metrics and signals that when crossed will generate alerts which can trigger actions. As with most things in Azure, alerting is essentially PAYG so I will also cover some guidance on how you can start to estimate the costs of monitoring and alerting.
Since my last blog Microsoft have changed the name of WVD (Windows Virtual Desktop) to AVD (Azure Virtual Desktop). An excellent overview of the reasons behind, and implications of, the change is available from Vadim Vladimirskiy at Nerdio.
Once you have set up a Log Analytics Workspace you will be able to setup and view various Dashboards or Workbooks in Azure Monitor or use other methods of visualizing and collating the data – for example, Power BI and Graphana are some of the integrations available. Azure dashboards are free to use so I am not going to cover them in this blog, they are well-documented in the Microsoft documentation, here. My ultimate goal is to try and work out how to calculate and estimate my end costs and Azure bill will be for monitoring and billing hence the emphasis on PAYG features.
In the previous blog, I started with an existing host pool named “UKGROUP” containing two WVD (now AVD) session hosts within a resource group “egwvd” located in the Azure region “East US 2”, I connected them all into a Log Analytics Workspace named “Rachels-LogAnalyticsWorkspace” and then added a 3rd session host. This is the configuration I will cover adding monitoring and alert thresholds to, today.
What you can monitor and raise alerts on with Azure Monitor
At this point, it is worth familiarizing yourself with the Azure Monitor Documentation. From the home page, you can navigate through the documentation tree on the left-hand side to “How-to-Guides” and then “Alerts”, however you should be aware that in addition to “How-to guides -> Alerts” you will probably want to familiarize yourself sections like “How-to guides -> VM Insights -> Alerts” and possibly even “How-to-guides -> Application Insights -> Alerts” (Azure’s APM monitoring).
For the AVD host pool, in my last blog we configured collection of the default Performance counters and Windows event logs. Log alerts can be associated with any of these metrics and log events.
There are also two types of log alerts in Azure Monitor associated with VM Insights (VM Insights being an optional agent you can install within the VMs in your AVD host pool, covered in my last blog:
- Number of results alerts create a single alert when a query returns at least a specified number of records. These are ideal for non-numeric data such and Windows and Syslog events collected by the Log Analytics agent or for analyzing performance trends across multiple computers.
- Metric measurement alerts create a separate alert for each record in a query that has a value that exceeds a threshold defined in the alert rule. These alert rules are ideal for performance data collected by VM insights since they can create individual alerts for each computer.
Beyond this if you are using Application Insights, there are alerting options with the smart detection module such as Failure Anomolies that it may be helpful to review.
The default Performance counter metrics (as of June 2021) for a WVD / AVD host are:
|Object name||Counter name||Instance||Interval|
|LogicalDisk||% Free Space||C:||60|
|LogicalDisk||Avg. Disk Queue Length||C:||30|
|LogicalDisk||Avg. Disk sec/Transfer||C:||60|
|LogicalDisk||Current Disk Queue Length||C:||30|
|Memory||% Committed Bytes In Use||*||30|
|PhysicalDisk||Avg. Disk sec/Read||*||30|
|PhysicalDisk||Avg. Disk sec/Transfer||*||30|
|PhysicalDisk||Avg. Disk sec/Write||*||30|
|PhysicalDisk||Avg. Disk Queue Length||*||30|
|Processor Information||% Processor Time||_Total||30|
|RemoteFX Network||Current TCP RTT||*||30|
|RemoteFX Network||Current UDP Bandwidth||*||30|
|Terminal Services||Active Sessions||*||60|
|Terminal Services||Inactive Sessions||*||60|
|Terminal Services||Total Sessions||*||60|
|User Input Delay per Process||Max Input Delay||*||30|
|User Input Delay per Session||Max Input Delay||*||30|
In the initial release, Microsoft included some per-process metrics, these were subsequently removed as they were collected for every application for every user and formed 80% of the data (and hence cost), see Updated guidance on Azure Monitor for WVD for details.
Some metrics that alerts can be set on are aggregated values and Alerts can be triggered by their constituent dimensions, billing is however per dimension. There is some fairly heavyweight documentation around this available. It is however something to be aware of if you are trying to evaluate the possible costs. We will cover more on this later.
The costs of monitoring the Azure Virtual Desktop Service
Estimating Azure Monitoring costs for AVD is somewhat difficult to figure out, Microsoft Support themselves have answered this in their Support Q&A “How to get Cost estimates for WVD Monitoring – Microsoft Q&A” by referring to a third-party blog from Tunecom consultant Jannick Dils – “How to keep control over your Windows Virtual Desktop Insights logs and costs – Tunecom“. It is an article both stunning in its depth and usefulness but also breath–taking in the complexity of the steps that Jannick took to reach an estimate of costs around $14 a month per session host, after calculating that a VM would generate 14.5GB a month to be ingested. Jannick turned to kusto queries to obtain some of the key variables need to make this estimate, which is no easy task. Beyond this, he also covers some very key caveats where it would be very easy to miscalculate costs by orders of magnitude. Many key session metrics collected can be expected to have multiple instances and a variable number of instances, for example per process metrics such as “% Processor Time” could easily be expected to have 50 instances associated with different processes. Additionally, metrics by default are configured to be collected at either 30 or 60 second intervals which complicates calculations further.
In practice, for many metrics you will need as Jannick did to collect some real-world data for an extended length of time to obtain meaningful averages for metrics that vary significantly and/or will have fluctuating numbers of multiple instances. The need to invest time and setup monitoring and then run it for a significant length of time to figure out if it’s an affordable monitoring option is somewhat of a “chicken and egg” situation. You will also want to consider at this point what guest metrics you wish to collect within a VM, Monitor Azure virtual machines with Azure Monitor – Azure Monitor | Microsoft Docs; in particular, it is likely that you may wish to enable “VM Insights”, see: Enable VM insights overview – Azure Monitor | Microsoft Docs. (I covered setting up VM insights in my previous blog).
Within a Log Analytics Workspace there is a “Usage and estimated” costs tab from where you can click “Data Retention”. Here you can indicate how long you want to store the data in the Log Analytics Workspace. By default, this is 30 days, but you can increase this to a maximum of 730 days. Users needing to change this would have to also factor the additional storage costs which will increase as you increase the number of days; the first 30 days of retention are free but beyond that there is a per GB retention cost per day. If you start fiddling with the interval on metrics like the Performance counters you will of course change the data volume and hence Log Analytics data costs.
Now this is all very well but it is likely that you are also going to want to trigger alerts based upon threshold values for key metrics and alerts, and in turn actions triggered by alerts (e.g., email the Azure administrator) these are all also pay for features, you will pay a fee for each alert you set. If you want to go further and integrate alerts with ticketing systems such as ServiceNow there are also associated costs. At this point you probably want to think about how you will manage alerts and actions and what integrations you need.
Setting up my first alert
At this point I decided to set up some alerts based on threshold on the simple Performance counter metrics e.g. if VMs or Hosts reached 90% of capacity. Note this metric was configured to be collected in the default set every 30 seconds.
Navigate via Azure – > Monitor and select Alerts (quick link, here), from here you can manage existing alert rules or add a new alert rules. An Alert Rule consists of 4 parts:
- Scope: What resource / resources to apply the test to e.g. a single VM, a pool of VMs etc
- Condition: This includes the metric or signal to be tested and what conditions (the logic) are defined to trigger an alert. The metrics you can select from will be filtered by what you chose as the “Scope” and you will need to be prepared to add a sensible value for thresholds of any numeric metric you choose.
- Actions: send notifications or invoke actions when the alert rule triggers, by selecting or creating a new action group
- Alert Rule Details: Including a name and description for the Alert Rule and a severity level
The easiest way to get started and get a feel for the alert interface and capabilities is probably by adding a simple alert triggered by a basic numeric metric value crossing a threshold value within the Condition of the Alert Rule. I decided to add an alert on one of the AVD hosts if the performance counter “% Processor Time” exceeds 90%.
Once one navigates to Alerts (quick link, here), you can filter on resource groups and resources, I found it was easier to follow what I was exploring by ignoring the filters for now and clicking directly on New Alert Rule.
Now we will define the scope, this is where you have to be very careful, as all Azure resources will be an option, and there are options such as Host pools and Virtual Machines which will take you to the same hosts in your WVD / AVD pool but a completely different set of signals and metrics. You need to remember that we are trying to configure alerts that are triggered from the aggregated data within the Log Analytics Workspace which was set up in my earlier blog, namely “Rachels-LogAnalyticsWorkspace”. Here I set the Scope to “Rachels-LogAnalyticsWorkspace”, by selecting it and clicking Done.
Now that the Scope is set, we can explore how to set thresholds for the metric “% Processor Time”, cliick or hover over Add condition, and the Select a signal window will appear,
Here I filtered on the Signal type to see metrics and then selected the “% Processor Time” by clicking on it.
Once you click on a specific signal or metric, you will see the Configure signal logic window, with a graph of the metric, now I had set the pool hosts up as a test hosts and as such it has had no real users or logged on sessions or applications running other than background processes and services so the recent “% Processor Time” has hovered around ~1.5% only, hardly a realistic load.
Note that this is an aggregated metric, i.e., the graph is showing the CPU usage averaged across all the hosts in the pool. At this point we need to consider what we wish to be alerted to, if we consider 90% CPU usage the threshold at which a host may have an issue, if 1 host is at 100% and 2 hosts are at 55% the average across the pool would only be 70% [(100+55+55)/3)] and I think setting thresholds on this aggregated metric would mean no alert would be triggered. Ideally, I’d want to set a threshold on every single host.
Now scroll down you will find the option to Split by dimension. Here we can split this metric into the data for each individual hosts. If you explore the options under Dimension name, you will see many are essentially internal parts of the underlying xml data schema which holds the data, although there is usually one human readable dimension, in this case “Computer”, which then allowed me to select the three hosts in my WVD / AVD pool. Note how I was informed that this is 3 separate metrics for the purposes of billing.
In the next section, we can choose to set a Static or a Dynamic threshold. To set a Static threshold of 90%, I need to set the Operator to “Greater than” or “Greater than or equal”. There are also various fields associated with granularity and aggregation. I set my threshold thus:
Note that the console displays “Monitoring 3 time series ($0.1/time series)”, i.e. for every static threshold set Azure will bill you $0.10 a month, if you wish to set a minimum threshold too e.g. an alert if your usage falls below 10% you will need to set a separate Alert which will also be billed at $0.10 (for a metric such as processor usage that normally wouldn’t make sense but for some signals it would).
Now at this point, rather than click Done. I decided to explore the Dynamic Thresholding capabilities by changing the metric type to Dynamic,
Note: how the cost per threshold increases “Monitoring 3 time series ($0.2/time series)”, all Dynamic thresholds are charged at $0.20 per metric. If you can scroll up now you will see the thresholds imposed on the metric graph:
At this point I thought “well fair enough on the $0.20” as the Operator was set to “Greater of less than”, I.e. two thresholds were involved but changing this to “Greater than”, I found that it makes no difference the price is the same regardless of a single upper/lower bound or using both.
Overall, the thresholding options are reasonable with options to set windowing thresholds and control the sensitivity of dynamic thresholds, but you do need to have a reasonable understanding of threshold types. Many traditional EUC sys admin type tools are based on infrastructure monitoring paradigms and are limited to static thresholds – if a server hits 99% CPU for 10 data points in a row raise an alert. Dynamic thresholds and auto-baselining are more common in AIOps (Artificial Intelligence for IT Operations) platform-based monitoring such as eG Enterprise where they are a key component of automated root cause analysis and anomaly detection. AIOps platforms use analytics and machine learning to learn what is normal for a system factoring in historical usage, time of day, seasonality and usual user behavior. In the case of the CPU used whilst protracted 80% usage may not be an infrastructure IT issue if 5% is more normal at 3am, the IT administrator will want to investigate as it could be a symptom of a security breach, or a new rogue batch job added to the system.
My little test pool illustrates some of the issues associated with dynamic thresholding, the fact the pool hosts had minimal load means that the limits for alerts would be set at a rather implausible values around my historical ~1.5% usage. Fundamental limitations of Dynamic thresholds are exposed in an excellent TechTarget article by Alistair Cooke in which he covers the fact that “Dynamic thresholds are not as intelligent as people. A dynamic monitoring setup can become confused when cyclic activity doesn’t happen according to usual patterns. For example, the support staff will get an alert that system load is low on a public holiday, because the users are at the beach instead of at their desks creating load.”.
Both static and dynamic thresholding have limitations and enterprise monitoring solutions such as eG Enterprise combine both static and dynamic thresholds on many individual metrics and use strategies such as auto-static thresholding to overcome the limitations and avoid consequences such as false-positive alerts and event / alert storms.
If choosing Dynamic thresholds on Azure, it would be wise to run your system for a while, so the baselines are set based on realistic usage. For some metrics upper or lower limits make little sense for alerts e.g. for CPU usage a host being idle is rarely of interest so at this point I decided to return the static alert as configured above, by clicking Done.
Having done this we return to the main pane, note that because I chose to monitor multiple hosts (dimensions) I am now informed that “Alert rules that monitor multiple dimensions can include only one condition.”. This means that if I did want to set a lower static threshold on all the hosts in the pool, I would need to set it up in an additional condition or I could choose to monitor the bounds of each host in their own separate conditions rather than all three (the later probably makes more sense but is tedious to set up manually).
Now click on “Add action groups” and the Add action groups pane will appear, now click on Create action group.
Select the resource group and choose a sensible name for the alert, note the displayed name will only be 12 chars long, which in an enterprise scenario could make it quite difficult to have meaningful names and schemas for alerts to indicate what resource they are associated with.
Do not click, Review and create yet, instead use the tabs to set the properties of the action, on the Notification tab I chose to send an email to the “Azure Resource Manager Role”. At this point you may get asked if you want to enable the common schema, historically logs and metrics and other signals have had different formats for alerts and this enables them to use a common new format, if you are setting up your Azure environment from scratch you will probably want to enable the common schema, but since I was using a shared sandpit environment I left things as they were because I do not know whether others are relying on the old format. Click OK.
Now choose a name, again hopefully meaningful.
Beyond sending notifications when the metric threshold conditions are crossed, users can also associate a set of actions with the alert. At this point you might want to review the Actions tab, it is on this tab integrations with ITSM service helpdesk applications can be configured to set up automated Jira/ServiceNow help desk tickets, there are costs with many of the actions/integrations so you will need to investigate further if you are planning to do a helpdesk like integration. At this point I chose not to add actions beyond notifications and selected “Review and Create”.
Review and hit Create.
Bizarrely, at this point, I was returned to a main dashboard with no sign of my new Alert rule. So, I had to repeat all my previous steps, however when I came to set up the Action group again, I found it had been created and so I didn’t need to repeat that step. This type of things seemed to happen occasionally throughout my encounters with Azure.
Finally, I set the Alert rule details and decided to set a warning level of 2 for the Severity and clicked Create alert rule.
I was now returned to the Alert rule pane, selecting the Manage alert rules tab, I could finally see my created alert.
By default, in eG Enterprise we configure multiple warning levels and thresholds associated with a metric, so at this point I wanted to do similar but I’m still unsure how to do this in Azure. Often, there is a need to set different threshold levels to map to different levels of severity of problems. The eG Enterprise system offers three levels of thresholds that correspond to the three alarm priorities – Critical, Major, and Minor. The user can specify three maximum and/or three minimum threshold values in the format: Critical/Major/Minor, which allows for escalation and granularity. So far, the only way I’ve found to do this is to create a whole new alert just with a different severity level, which would leave me with an alert storm issue as I could get a Critical/Major and Minor alert for the same event.
Checking the costs of alerts
Now assuming you follow the process above and set up alerts for every metric you need. You can check the overall costs via the Monitor blade (UI pane), scrolling down to find Usage and estimated costs.
How much is your time worth?
At this point I gave up as it was just getting too complicated and time consuming and concluded that Rob Beekmans was correct that the best way would just be to run the analytics for a month and keep an eye on the costs whilst doing so.
All the components are there to build your own monitoring and alerting system, but you really have to build it before you can work out if it is sufficient let alone how much it will cost. I experienced so many challenges to doing this even on my teeny tiny pool of single figures of hosts with a couple of VMs on each that I admitted defeat, although the exercise has given me a good idea of the challenges and things to consider if you are brave enough to try, challenges I see:
- To create something anyone else could use and maintain I would have to define templates, naming conventions (there are a lot of freeform text fields) and processes to enforce where there are none, the idea of documenting this is overwhelming.
- I’m not sure how an organization could easily enforce or audit any written guidelines
- The volume of signals, logs and metrics is vast, and many have names that are frankly near meaningless; I didn’t have sufficient knowledge to know what key metrics would alert me to problems and I was too scared of racking up a huge Azure bill to turn everything on and work that out retrospectively
- Adding alerts is a highly manual and time-consuming exercise, I imagine somehow you can automate this via some script voodoo magic but again it wouldn’t be something you could easily handover to a helpdesk administrator
- The lack of tiered alerting – to recreate the multiple alert levels (minor, major, critical) we have in eG I think I would have to create multiple alerts on the same metric at different
- There’s no root cause analysis or alert filtering/correlation built into the alert system.
- There’s no obvious mechanism to combine Static and Dynamic thresholds as offered by enterprise monitoring tools, so whilst there is a degree of anomaly detection available tuning and avoiding alert storms will be extremely tricky.
- I thought someone else must have already done this, however I found remarkably little helpful information regarding automating the setup of Monitor alerts. Stanislav Zhelyazkov blog was rare in that it described how to script the setup.
- The level of analytics available and historical data in Azure Monitor is very good, but I came to the conclusion that Monitor isn’t really a monitoring solution more a monitoring framework (it’s kind of like the Kubernetes of monitoring). Plus, there’s the fundamental challenge that to use Azure Monitor you have to be able to access Azure, how do you know what is going on when users report issues and Azure is down in your region? Monitoring Azure availability and performance within Azure itself alone will not be enough.
- Rob Beekmans was right – “I think one of the issues deploying on cloud is that there is no good prediction of actual costs to come. I’ve been looking into this for a customer and I don’t see how you can know, as a customer, know what your monitoring/analytics bill will be without just running for a month”
For information on eG Innovations’ support for digital workspaces on Azure, please see “Azure Monitoring Tools and Solutions from eG Enterprise”.