In this blog post, I follow on from my previous blog on AWS CloudWatch Part 1 of 2 to explore how you can go beyond basic agentless CloudWatch monitoring by deploying the CloudWatch agent and some of the key information and planning you need to do this. I’ll also cover how eG Enterprise offers out-of-the-box functionality to avoid complex JSON scripting or tooling to implement monitoring.

Installing a CloudWatch Agent for Better Insight

If you are deploying Linux or Windows OSs on EC2, you will probably want to deploy the CloudWatch agent on each instance to enable a wider range of in-guest metrics. If you install the agent on an Amazon EC2 instance, the metrics it collects are in addition to the agentless metrics enabled by default for Amazon EC2 instances. For an overview of Agentless vs Agent-based monitoring, please see: Agentless Monitoring vs. Agent Based.

There is a reasonable “Getting Started Guide” for the CloudWatch agent and it is worth reviewing the metrics the agent will enable on Windows Server instances and also upon Linux/MacOS instances. On a server running Microsoft Windows OS, installing the CloudWatch agent enables you to collect the metrics associated with any of the counters in Windows Performance Monitor. The CloudWatch metric names for these counters are created by putting a space between the object name and the counter name. For example, the % Interrupt Time counter of the Processor object is given the metric name “Processor % Interrupt Time” in CloudWatch. For more information about Windows Performance Monitor counters, see the Microsoft Windows Server documentation.

What AWS CloudWatch monitoring offers

Figure 1: How an AWS CloudWatch agent complements basic metrics regarding EC2 instance health

The steps involved in deploying CloudWatch’s agented monitoring have evolved but there is a fair amount of configuration of JSON files or python required even to define which perfmon counters you wish to collect.

Ease of Agent Deployment and Configuring Required Parameters on AWS CloudWatch

There is a degree of complexity involved in configuring an AWS CloudWatch agent. Manipulating the JSON configuration files is cumbersome. You need to:

  • Which qualifying metrics to monitor
  • Same metrics have different name and syntax across OS (Win, Linux, AIX, etc.)
  • Maintain and version control the config.json file
  • Handle lengthy Config files – With each increasing metric, the config file tend to become long and prone to mistakes.

For example, to monitor processes on a system:

  • Process name – Each process name must be defined
  • Metric name – Each metric for the process must be defined
  • Metric Collection Interval – Each metric section for metric Collection Interval must be defined if it is to be set other than 60 seconds

Configuring AWS EC2 detailed metrics in AWS CloudWatch

Figure 2: AWS config.json file for configuring process monitoring

Implementing Alerting, Reporting and Dashboards

There are lots of tools within AWS CloudWatch to enable you to configure a monitoring solution but very little ready-to-go functionality. If you are relying on native AWS tools, you may find it helpful to familiarize yourself with the numerous repositories of tutorials that exist to cover common requirements, such as Jackson Connell’s “50 Best AWS CloudWatch Tutorials”.

Out-of-the-box Dashboard

Out-of-the-box there are basic metric dashboards available, but the observability of configured parameters is a somewhat cumbersome task for administrators as they have to toggle between multiple screens and pages to monitor/eyeball the metrics. The basic view of the Default Metrics can be viewed under – EC2 Instances -> Monitoring (see below):

AWS CloudWatch monitoring screen

Figure 3: CloudWatch monitoring screen

The minimal dashboard functionality mean that many AWS users export the CloudWatch metrics stored in logs to other BI tools, such as ELK or Grafana which offer more advanced visualization and there are charges applicable for these tools.

Configuring Alert Thresholds and Alarms

The majority of administrators, especially with large deployments will want to configure their monitoring systems to proactively notify and alert them of problems. It is unviable for an administrator to physically watch dashboards all day or attempt to manually/visually correlate metrics of concern indicative of an issue. AWS does allow thresholds and alarms to be configured but this is a manual process requiring an admin to decide on the thresholds at which individual metrics should trigger alerts; some problems by their very nature will trigger alert storms.

CloudWatch monitoring conditions

The conditions based on which alerts can be triggered allow static and “Anomaly detection” thresholds (comparable to “Dynamic” thresholds in Microsoft Azure cloud) to be set. The limitations of these two types of thresholds are well known, and in practice, enterprise-class monitoring solutions, such as eG Enterprise, have integrated functionality to combine the two genres to reduce false positives; as of today, cloud native monitoring within AWS and Azure does not allow this.

Alert and Event Correlation and Root-Cause Analysis

In practice, a single IT issue can trigger numerous thresholds to be crossed generating alerts, where most alarms raised are secondary symptoms of the true fault or bottleneck. The user-implemented thresholding and alerting in cloud native tools as of the time of writing this blog post do not provide the correlation, event filtering, and AIOps root cause diagnosis capabilities available in dedicated enterprise monitoring products.

Licensing/Costs of AWS CloudWatch

There are a large number of parameters that affect the costs of CloudWatch and it can become extremely complex even with basic monitoring. Many users underestimate the costs as multiple cost factors must be considered for even single service or metric. Key considerations and factors you need to consider include:

  • Configuration and coverage:
    • For CPU utilization – Different config.json files must be maintained for same CPU utilization objective cross different OSs.
    • Process-wise CPU utilization – Each process must be distinctly added to config.json file.
    • If your eventual goal was to be able to routinely monitor the “Top 10 CPU consuming Processes” (a standard feature in monitoring products) – it is essentially impossible to add each process to config.json file. Even if added, the metric counts will run into 1000s.
  • Resolution-based licensing:
    • Multiple resolutions are available: (<1min, >1min, and 5 mins) – there is a different costing for different time resolutions.
  • Data retention:
    • Data points with a period of 60 seconds (1 minute) are available for 15 days.
    • Data points with a period of 300 seconds (5 minute) are available for 63 days.
    • Data points with a period of 3600 seconds (1 hour) are available for 455 days (15 months).
  • Dashboard views – CloudWatch implements per dashboard view based costing.
  • Log Monitoring:
    • Data volume – per GB of data file transferred
    • Storage – per GB of data stored on S3

My colleague wrote a recent blog post covering estimating data volumes when deploying Azure Log Analytics and Azure Monitor to monitor WVD/AVD (Windows/Azure Virtual Desktops) and the process of estimating data volumes is similar on AWS. Particularly when monitoring application processes, it is likely that you will need to accumulate some real-world data of your AWS deployment in production to gain realistic estimates of volumes and costs on-going.

AWS has done a respectable job of documenting the process of estimating costs via a number of sample calculations; if your deployment closely matches one of these examples (found at the bottom of this page), they may be very helpful. They give a good overview of many additional types of monitoring that can be added to a deployment and customizations and configurations an administrator is likely needed to make.

Summary of the Challenges of Using AWS CloudWatch

If using AWS CloudWatch alone, an administrator should consider and plan how to invest time to address inherent challenges, including:

  • Monitoring – Though you can monitor at very granular levels, it will become very complex to manage the config file defining what metrics to collects. Manual maintenance of config files is prone to errors.
  • Event correlation – A lot of effort and skills will be required to correlate events, suppress false positive, and help drill down to probable cause.
  • Licensing and Costs – This will remain an open-ended question. Budgeting will be difficult as you grow and may lead to unknown expenses and leakages. Additionally, one has to keep in mind the subsequent services like SNS, SSM, EBS, etc. that will be required for effective monitoring and reporting.
  • Ease of deployment – AWS Systems Manager may help in a few of the monitoring operations like config release management. However, config files still remain a manual and complicated task. Because of this, and limitations of AWS in consuming a number of metrics, many admins create their own scripts, do the processing needed in their script, and just use AWS as the display tool.
  • Observability – Monitoring just by looking at an alarm is not an effective method. Maneuvering across various dependent components and drilling down for root-cause analysis is a key to faster mean time to repair (MTTR).
  • Dashboards – They can be extremely handy but will require thoughtful and methodical mechanics in the prior steps of qualifying metrics and building strong correlations. Also, scripting and programming knowledge on LAMBDA, Python, etc. plays a major role in building an effective dashboard for respective teams.

How eG Enterprise Simplifies Monitoring of AWS EC2 Instances

For many of our customers, especially those choosing hybrid deployments, choosing the native AWS CloudWatch options and an AWS-only interface doesn’t make sense as it is an extra monitoring tool requiring staff to train and maintain and monitor costs. AWS themselves acknowledge that for many of their customers, CloudWatch alone is insufficient or simply too complex for some customers.

eG Enterprise makes it very simple to monitor your AWS EC2 instances. By integrating with AWS CloudWatch APIs, eG Enterprise provides an “outside view” of the performance of all the EC2 instances. Other than providing credentials to access AWS CloudWatch APIs, there is no configuration necessary. Pre-configured thresholds alert administrators to issues that affect the performance of EC2 instances and applications running on them. Figure 4 below shows some of the key metrics for EC2 instance performance on eG Enterprise.

Monitoring AWS EC2 instances with eG Enterprise

Figure 4: Monitoring of AWS EC2 instances in eG Enterprise

In-guest metrics can be obtained by simply installing an agent inside the EC2 instance. You need not manually install the agent each time. You can embed the agent in your Amazon Machine Images (AMI), or you can use an orchestration tool like Terraform or Cloud Formation to install the agents as your EC2 instances are provisioned. The default built-in profile of a Windows or Linux system on eG Enterprise ensures that admins don’t have to spend time and effort choosing what metrics to collect. Automatically, the agents collect the pre-defined metrics and you can visualize them in the eG monitoring dashboard (see Figure 5). Everything from CPU and memory utilization, disk IOPS, page file usage, network traffic and errors, TCP retransmissions, status of key services and events in the Windows event logs are monitored by default.

Monitoring of an EC2 instance running Microsoft Windows OS

Figure 5: Stack model of an EC2 instance running Microsoft Windows OS

Different visualizations are available. Besides the layer model view, there are other dashboards you can use to visualize the data. If you want, you can create your own custom dashboard as well.

Detailed diagnosis is also automated in eG Enterprise. When a metric changes state to an abnormal state, the agent automatically collects additional details to enable IT admins troubleshoot the problem quickly. For example, in Figure 6 below, you see the detailed diagnosis of a Linux instance that is seeing high CPU utilization. The detailed diagnosis shows the top 10 CPU consuming processes on this instance. Again, administrators do not have to pick and choose what they want to see. eG Enterprise’s detailed diagnosis is pre-tuned to highlight the key details needed to troubleshoot each threshold violation.

Detailed diagnosis of a Linux EC2 instance

Figure 6: Detailed diagnosis of a Linux EC2 instance showing the top processes taking up CPU

Single pane of glass monitoring with eG Enterprise

Here’s a comparison of the capabilities of eG Enterprise and AWS CloudWatch as it relates to monitoring EC2 instances.

AWS CloudWatch eG Enterprise
Ability to collect key performance metrics per instance including CPU credit balance Yes Yes, by integration with CloudWatch APIs
Simple agent deployment for in-guest metrics Yes Yes
Ability to collect system KPIs within the guest without much configuration No Yes
Ability to collect detailed diagnosis data for rapid troubleshooting No Yes
Cost depends on metric collection frequency and number of metrics collected Yes No
Extend the same agent to collect application-level performance metrics No Yes
Ability to collect configuration data about the OS (patches, hot fixes, etc.) and report on changes No Yes
Ability to automatically monitor key system logs Only by writing custom scripts Out of the box

More details of AWS monitoring with eG Enterprise is available here: eG Enterprise for AWS Monitoring. eG Enterprise is also available as SaaS in multiple AWS regions from AWS marketplace, here: AWS Marketplace: eG Enterprise Express Cloud (amazon.com).

Using eG Enterprise allows AWS users to go beyond CloudWatch to:

  • Monitor all their AWS components such as EKS, EC2, ELB, RDS, SNS, Amazon Lambda and more, within a single product as a “single pane of glass”, alongside other clouds, such as Google GCP or Microsoft Azure and alongside on-premises physical and virtualized infrastructure, such as Citrix/VMware/Nutanix.
  • Use detailed out-of-the-box or customizable reports for forecasting, capacity planning, and management insights based on data that can be retained on whatever timescales suit an organization.
  • Deploy metric collection and set up alert/alarm thresholds out-of-the-box with pre-configured industry standard values that can be modified via a GUI, avoiding the need for administrators to code or configure with python or JSON files.

Learn More