Understanding Azure Monitoring Costs

Thomas Stringer has a couple of great blog posts on how to understand your Azure monitoring costs and also on how to reduce your costs, see Azure Monitor Log Analytics too Expensive? Part 2 – Save Some Money | Thomas Stringer (trstringer.com).In the past I’ve blogged on How to calculate the Azure Monitor and Log Analytics costs associated with AVD (not an easy task!).

Thomas is a Principal Software Engineer Lead at Microsoft and often authors articles on the cheapest and most cost-effective ways to leverage Azure; his articles answer questions such as “What’s the cheapest way to run Kubernetes in Azure?”.

Thomas lists a handful of useful strategies to reduce your costs, alongside some of the pros and cons associated. In general, this type of cost optimization inevitably comes at the cost of increased bespoke scripting or manual work or a reduction in the level of visibility or sensitivity of alerting. It really is a balance and trade-off. The main strategies suggested are:

Now, all of these are valid steps in the right use case, however, add complexity and may also impact visibility. One of the main use cases where customers turn to a third-party solution like eG Enterprise is to get increased visibility and to reduce manual steps at a lower cost than what native cloud monitoring offers. From working with customers, who have previously implemented Thomas’ suggestions, I’m going to cover my own thoughts on things to consider.

1. Log less data

This is (perhaps surprisingly) one that most monitoring vendors would concur with…. or perhaps that should be “Log the minimum data you need to suit your needs”. Good monitoring platforms with root cause diagnostics and AIOps correlation technologies are carefully configured out of the box to capture data on optimal timescales at optimal frequencies. Sampling metrics every 3 seconds, etc., is pretty dumb as it usually swamps systems with lots of noise and in the case of cloud, leads to ludicrous data and API costs. However, the data collection with cloud native tools is already pretty low by default – indeed Microsoft dialed back metric collection for AVD to exclude process information because customers weren’t keen on the costs, leading to the surprising scenario that an administrator can’t routinely find out which are the top 10 applications in use.

Thomas also suggests: “Do you need debug logs in your workspace? Usually, the lower the log level, the higher the number of logs. Understand your requirements and don’t over-log unnecessarily. But what happens if you have an outage or something you need to troubleshoot? Nothing better than those debug logs! Perhaps, create a process to hot set the log level lower (e.g., to debug) temporarily so that you can collect more verbose logs short term.”

This is precisely the scenario that as a monitoring vendor we have engineered our products to account for. Thomas is correct; most of the time, debug logs will be vast pools of data that you will never need to access, racking up your cloud costs but when you have an issue you need them. Temporarily, turning logging back on is often only appropriate for recurrent or persistent issues and does not allow you to retrospectively diagnose issues such as “why did the web site go down at 2am?” or “why was my desktop slow yesterday?”. In VDI/digital workspace environments, if sessions die, logs are often lost if they aren’t collected routinely by process for every user logon. The strategy eG Enterprise and a few other products take is to collect detailed data if triggered to do so when key health check tests indicate there may be a problem or there is anomalous behavior detected.

If you go down the path of reducing log level, etc., you probably will want to also track on-going incidents and the level of manual labor debugging takes.

2. Shorter data retention times

Thomas suggests returning to or sticking with the default 31-day period retention policy for Log Analytics Workspaces if you can. This will probably suffice if you just want to do monitoring for troubleshooting, however, many organizations require data on much longer timescales to effectively plan capacity and to accurately understand usage trends particularly in highly seasonal businesses, e.g. universities and schools with terms and long vacations.

Many of our Azure customers typically retain key data for at least a year. Once logs have been probed and key data extracted, it is exported out of Azure and is kept available for future analysis.

A workspace consists of different tables that store different types of data dependent upon the data source. By default, all the tables have the same type of retention as the workspace, but this can be customized to have different retention for different tables – Marius Sandbu covers this in depth in this article: https://msandbu.org/changing-log-retention-on-a-specific-table-in-log-analytics/.

3. Offload logs to cheaper storage

This is one trick available for those who want to keep data in Azure beyond 31 days. With the potential of an 80% cost saving vs leaving the data in Azure, it is one to consider. Pulling logs back in as required is a bit of a faff especially, if another group is going to need access e.g., for capacity planning, cost analyses, etc.

It’s a little irritating that this type of trick is even a thing as it is something cloud providers could automate and it’s simply down to the nuances of cloud pricing.

4. Use the commitment tier pricing

If you are willing to get into vendor lock-in, then committing to a fixed data volume will usually be significantly cheaper than pay-as-you-go. Beyond the lock-in downside, you will have the headache of estimating the volume you need and if you later realize you are not collecting enough data or are collecting too much you may have a problem. How to handle those debug logs in the event you start needing to deep dive frequently may become a liability.

Some workloads will be easier to predict than others. On AWS, RUM monitoring is priced per data item associated with RUM events, which include a page view, a JavaScript error, and an HTTP error. This means if you have faulty apps or faulty third-party services used by your apps or a malicious DOS storm on your website, you could potentially generate a vast amount of data and costs. These pricing models where you can end up paying for monitoring data of unknown volumes, triggered by factors completely out of your control are troublesome, and of course, make estimating future usage somewhat harder. I’m not aware of similar pricing on Azure but it’s one to look out for on any cloud native monitoring tools.

5. Use fewer workspaces

As Thomas highlights: “If you’re using the commitment tier pricing model like mentioned above, it is cost-effective to have fewer workspaces. Per workspace the price per GB goes down as you commit to more data.” This is also a well-accepted strategy recommended by Azure experts, such as Marius Sandbu, see Deep dive Azure Monitor and Log Analytics | Marius Sandbu (msandbu.org).

Beyond cost benefits, it is far easier to correlate data that is all within a single workspace. Of course, other constraints, such as security and compliance may dictate the need for multiple workspaces e.g., for most, a single workspace per cloud region makes a lot of sense to ensure compliance with GDPR data protection type regulations.

Final thoughts

As a final thought, Thomas warns readers against building their own bespoke solution to monitor Azure logs for a number of reasons. We have certainly seen a few customers, who have gone down this route when the costs and functionality gaps drove them to go beyond the Azure native tools, and one of the reasons they have ended up as customers is that, for most organizations, building your own does not make sense vs. an out-of-the-box no-scripting needed enterprise solution, such as eG Enterprise.

Learn More