Cloud Outages do Happen
Over the past few months, I’ve written a couple of blogs analyzing significant Azure outages that affected multiple services. These articles covered detecting cloud outages long before Microsoft confirmed them and provided details of symptoms we saw. You can read these articles about a September 2022 outage and another in January 2023.
In this article I’m going to cover the very serious concerns and issues these raised for many organizations reliant on large cloud providers such as Microsoft Azure and Amazon AWS for significant portions of their IT operations including key services.
The recent outages didn’t take out the odd one or two obscure services, rather mainstream apps and services organizations rely on such as:
- Access to Azure Monitor
- Email services such as Outlook and Exchange
- File sharing services such as SharePoint
- Communication Tools such as Microsoft Teams
And rather worryingly, Microsoft Defender for Cloud Apps, Identify and Endpoint was spotted as down, prompting an interesting discussion within the LinkedIn EUC (End User Computing) guru community started by Adam Cooperman from Flexxible (it is worth checking out the comments on Adam’s post at https://www.linkedin.com/feed/update/urn:li:activity:7023973722583625730?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A7023973722583625730%29).
Monitoring Tools can be Vulnerable to Cloud Outages
Adam’s comments were thought provoking. Defender being unavailable highlights the challenges of relying on cloud hosted services for critical functionality. The same questions can be asked of monitoring services as well. Your monitoring service may be hosted in Azure, and it could go down. Indeed, Azure Monitor itself was also affected so those reliant on that native cloud service for monitoring alone were totally in the dark.
Even if your monitoring service itself is up, if you use Office 365 for email delivery that may not be working so you need to consider your notification strategy. eG Enterprise includes several resilience features that can be configured if you want to ensure redundancy. When choosing to rely on any cloud services you should investigate whether similar resilience features are enabled to protect you from cloud outages.
For larger organizations and MSPs (Managed Service Providers) communication processes around cloud outages internally and to customers are vital especially if the end users turn to help desk services.
Configuring your Monitoring Tool for Resiliency
Here are some key features of eG Enterprise that you can leverage or configure if you want to ensure that your monitoring is resilient:
- Use eG Manager Redundancy: eG Enterprise provides you with an option to install two eG managers in a redundant configuration. Each manager will have its own database instance for storage. Once you have configured the eG managers in a redundant configuration, the two managers constantly synchronize with each other. All the metrics available in one manager are also available in another, and if one manager goes down (e.g., because the cloud data center used is down), the other manager takes over all the functioning of the manager that has gone down. All the metrics collected during the time a manager is down are saved and transmitted to that manager when it comes back up later. With a redundant configuration of managers, you will never miss any alerts. The two managers can be located in geographically different locations, across cloud providers or between on-prem and cloud data centers. See Data Center and IT Infrastructure Redundancy Management (eginnovations.com) for details.
- Built-in Agent Resiliency – While the eG agent transmits metrics in real-time (as and when they are collected), if there is a communication problem with the eG manager, it stores data locally and resends the stored data when connectivity to the eG manager is restored. This ensures no data loss during communication outages. Details are given in Self-Monitoring and Recovery (eginnovations.com).
Multiple Mail Servers – Alerting about potential and current problems is an important function of a monitoring tool. If the mail server used for communicating about alerts is slow or is down, email alerts will not go out on time and this could impact your IT operations. In eG Enterprise, remember to configure multiple email servers for mail notifications. For example, you may be using Azure-hosted Office 365 as your primary email server, but you may want to configure AWS SES (Simple Email Service) as your backup mail service. This way, even if O365 is down, the eG manager can send email alerts using AWS SES. For more details, see: Configuring a Backup Mail Server (eginnovations.com).
- Configure Email and SMS or WhatsApp notifications for each user: While email is often the primary mechanism for alert notification, eG Enterprise also supports short message service (SMS) and WhatsApp for notification. To cover for scenarios when users are not able to access their email, you may want to configure user accounts with phone numbers as well so SMS/WhatsApp messages can be sent to users instead. See: Configuring the Mail Alert Settings (eginnovations.com)
- Leverage ITSM integrations with ticketing and service desk tools: Another option is to have alert notifications generate tickets or incidents automatically in your ITSM tools such as ServiceNow. eG Enterprise’s ITSM integrations are full API integrations for tools such as ServiceNow, Freshdesk, Autotask and JIRA, and configuring the integration is very simple. You do not have to write any custom scripts or configure scripts for every metric. The integration is enabled in a few easy steps on the eG manager. See Service and Help Desk Automation Strategies | eG Innovations or Integration with multiple ITSM tools at the same time (eginnovations.com) for more details.
- Configure eG Enterprise to send Heartbeat emails: Some administrators rarely login to the eG Enterprise system, preferring instead to rely on email alerts to be notified of problems. In such situations, if the email system that is used by the eG Enterprise system fails, administrators will not be notified of problems. A “heartbeat” function is supported by eG Enterprise to let administrators know that the email alerting functionality is operational. If you do not receive an email from eG Enterprise at the configured frequency, you know that there is a problem sending emails from eG Enterprise. See: Configuring the Mail Alert Settings (eginnovations.com).
Publish eG Enterprise dashboards for your IT Ops team: Consider publishing eG Enterprise dashboards for your employees and users to check key applications and service availability. In call center type scenarios, many of our customers use TV / Kiosk mode to publish service overviews on large screens for their employees to reference. Remember, as happened in the recent outage , it can be hours before Microsoft’s official Azure Service Status is updated with outage details.
- Proactively Monitor Business Critical Services and Applications: If O365 and other services or applications (MS Teams, Zoom) are critical to communication within your business or for your monitoring or ITSM functionality proactively monitor them. eG Enterprise also offers Enterprise Application modules to monitor O365 applications, Moodle, SAP, SharePoint, MS Teams, Salesforce and more.
- Ensure that you have Synthetic Monitoring configured for all services, especially cloud services: The value of synthetic monitoring cannot be over-emphasized. By checking service availability 24×7, you can be proactively alerted to issues with you key applications and IT services. Furthermore, eG Enterprise’s synthetic monitoring is fully integrated with the rest of its monitoring capabilities, so there is no need to use another web console, or configure another service for monitoring (see: What is Proactive Monitoring and Why it is Important (eginnovations.com)).
- Minimize your Reliance on Cloud Services for Monitoring: Some monitoring tools are little more than some automation and UI components built upon native cloud monitoring services such as Log Analytics and Azure Monitor. eG Enterprise minimizes use of Log Analytics Workspaces (partly to achieve cost savings) and leverages multiple data collection mechanisms to ensure visibility on Azure Outages. A similar strategy is used in the product’s approach to AWS monitoring and CloudWatch.
Have an Observability DR Strategy
Whilst many issues around options can be solved by continually forwarding data to a second system or running two different observability tools, this usually comes with double the costs. In cloud pay-as-use for storage, API usage, alerts and data export makes this extremely costly and leaves you with ongoing challenges of data management and data synchronization. Any second commercially supported system is also likely to incur licensing costs too. It’s very rare for DR / BC strategies to adopt parallel duplicate systems. Even with mission-critical “hot VM” VDI backup systems, they run on a failover basis not expecting two systems to be running all of the time when there are no issues.
Good business continuity strategies around observability generally follow similar models to other IT DR and BC methodologies.
GDPR and Data Control Regulation Compliance
Many of our customers are subject to stringent data control regulations export compliance laws. When using a SaaS monitoring platform especially those hosted on public clouds such as Azure or AWS, you should seek formal confirmation where and how a failover will occur.
If the SaaS platform is hosted in AWS in Germany, you should know in the event of failure where the backup manager is as a backup manager on Azure also in Germany would meet most customers data compliance needs, a backup manager on AWS in the US probably would not.
We are committed to providing feature parity no matter how a customer deploys eG Enterprise – whether that is on Cloud, on-premises or via our managed SaaS service. This means we can offer a large range of redundancy options that are fully compliant with local data control regulations for customers.
- You can read my recent postmortem blogs on Azure Outages Is Azure Down? – Proactive Alerting for Azure Outages (Sep 2022 incident) and Is M365 Down? – Proactive Alerting of a Microsoft Azure Outage (eginnovations.com) (Jan 2023 incident).
- If stringent data control during failover is a concern you may find some of Peter Claridge’s recent articles of interest, particularly On-premises, Cloud First or Cloud Repatriation – What’s the Trend? Which is Best? and How MSPs can Capitalize on the Rush for Localization of IT Services
- Find out more about monitoring for Azure, here: Azure Cloud Monitoring Tools for IaaS, PaaS, SaaS (eginnovations.com) with more details on AVD, available, here: Azure Virtual Desktop Monitoring | eG Innovations.
- If considering a multi-cloud strategy for resilience, you might like to read: Monitoring and Troubleshooting Multi-cloud Infrastructures (eginnovations.com).
- When assessing monitoring tools for cloud, the evaluation checklist provided in this article can be very useful: Top 10 Requirements of Cloud Monitoring Tools (eginnovations.com)