AWS Cloud Case Study: Troubleshooting Application Performance

How a full stack monitoring solution helped our customer with Application Performance Troubleshooting on AWS Cloud

Summary

AWS myth
Here’s a myth that needs to be debunked – the cloud (e.g., AWS or Azure) will take care of my performance problems!

AI powered AWS monitoring

Our experience shows that cloud architecture usually introduces new layers of complexities that did not exist in the on-premises world. You need a modern AI-powered full stack monitoring solution to find the needle in the multi-layered haystack that is the cloud.

Sometimes, it’s the cloud vendor who has to fix the issue. An example could be a noisy or defective physical host OS that you have no access or visibility into. You need the right information in the form of logs, metrics, traces and events to substantiate with evidence in conversations with the cloud provider’s support team.

In this blog post, we describe a problem analysis and anomaly detection process for a cloud performance problem (high CPU in the JVM and SSL issues) that we encountered recently when working with a large customer, who had a significant footprint on AWS cloud.

The Cloud Does Not Make Application Performance Monitoring and Troubleshooting Simpler

Cloud is Not Easy Cloud migration and digital transformation of business-critical applications on the cloud are on the rise. At the same time, some IT executives have the misconception that when applications are migrated to the cloud, monitoring their performance becomes easier. The cloud service provider (e.g., Amazon) takes care of most of the key elements that support the application, and hence, they believe that there are fewer challenges when applications are on the cloud vs. being deployed on-premises.

This is often far from true.

This case study exposes the challenges that organizations deploying applications on the cloud face when performance issues happen and how difficult it can be to diagnose such problems. While migration to the cloud does simplify and automate several routine administration activities, performance monitoring is one area that is not necessarily simplified!

The scenario covered may be useful for those evaluating the level of insight and troubleshooting capabilities various cloud tools can offer.

Target Application and Infrastructure on AWS Cloud

AWS. Java and SQL Server logos

Our customer is a mid-sized business that offers SaaS services to clients. The SaaS application was Java-based and was hosted on AWS cloud, and relied on AWS RDS for the backend database. eG Enterprise was used by the customer to monitor the SaaS application and the AWS services in use.

The application had been operational for several months and there had been no complaints about performance.

The Initial Alert

The CPU usage of the AWS EC2 VM hosting the key application had spiked all of a sudden around 5 am and remained high for several hours:

Figure 1: CPU usage of the VM had spiked suddenly

At this time, application performance also suffered. This was evident from eG Enterprise’s synthetic monitoring capability that the customer had configured to proactively detect issues before real users experience issues. Initially when the problem started, application response time was poor. Over time, you see gaps in the response time in the graph in Figure 2 because the application became unavailable.

Figure 2: Response time of the application, measured using synthetic monitoring

Below is the graph of application availability during the same time scale. You can clearly see TCP connection availability dropping to zero several times indicating that users were not able to connect to the application. This information was also highlighted in critical alerts sent to administrators and displayed on the overview dashboards in eG Enterprise to enable actionable notification.

Figure 3: Application availability measured using synthetic monitoring

Analyzing Application Performance in Detail

To further analyze the problem, we analyzed the application performance in detail. Figure 4 below shows the CPU usage of the JVM (Java Virtual Machine) used by the application on the problematic AWS VM. The JVM CPU usage follows the same pattern as the VM’s CPU usage, indicating that the application’s JVM had been affected.

Figure 4: CPU usage of the application’s JVM

High CPU usage within the JVM can have numerous causes and often it can be a symptom of bugs within the Java code of the application. High CPU is a symptom – not a root cause. Reasons for high CPU could range from host OS issues, poorly sized JVMs, memory leaks to code-level deadlocks. My colleague has written a blog covering some common problems, see: How to Troubleshoot Java CPU Usage Issues | JVM High CPU Threads (eginnovations.com).

The impact of Java Garbage Collection on application performance

Java Garbage Collection (GC) is intrinsically a CPU intensive operation.

Excessive Garbage Collection When Java GC happens, all application threads are paused. Most Java GC algorithms (including the latest G1GC) must halt all application threads, a process referred to as stopping-the-world (STW) or pausing (a GC pause).

The JVM (Java virtual machine) takes over all the computer’s cores to perform GC and reclaims memory before restarting the application threads.

Best Practice: Look for full stack monitoring tools that can help you diagnose high CPU root-cause across various layers and tiers – Operating System, JVM (threads/ heap), application code and connect the dots to the affected business transactions so you can quantify business impact. Full stack tools can correlate GC activity at the JVM level, thread activity at the code level and CPU utilization% even at a request-by-request level. This will help you triage faster and engage the right team to fix the issue.

The JVM Garbage Collector (GC) can also be a source of high CPU, especially if application memory leaks are at play. The JVM knowing that it is running low on key resources, such as heap memory can get itself into a state where it keeps trying to desperately reclaim memory. A quick check eliminated the possibility of a GC issue in this case. See Figure 5, which shows the historical usage of GC in the JVM. The percentage of time spent by the JVM on GC activities had not changed significantly during the problematic period. Hence, GC activities were not the reason why the application’s CPU usage had spiked.

Figure 5: GC activity of the Java application indicated no anomalies

Java threading issues are a common problem that cause application problems. So, the next step was to check if there were common threading issues present, and if so, were these root cause issues or simply manifestations and symptoms of issues elsewhere. Unfortunately, both are common possibilities!

Figure 6 shows the total number of threads in the JVM over time. While there had been an increase, the increase was not too significant. This indicated that application processing in the JVM was not the issue.

Figure 6: Tracking the number of threads in the application’s JVM

3 Use Cases for which continuous thread analysis is key

Use case #1: Identify the dreaded Achilles heel of threaded applications – locks and deadlocks that impact your scalability.

Use case #2: Automatically identify CPU-hungry threads and thread groups. On the cloud, CPU is money.

Use case #3: Pinpoint the root-cause of the thread anomalies to the specific processes and microservices so the right team can be alerted to fix the issue.

The Java Thread Analysis modules within eG Enterprise further indicated that no specific thread in the JVM was taking a lot of CPU. There were many threads, each taking a small amount of CPU – 1-2%, but a number of these threads caused the overall CPU to be high.

Figure 7: Details of threads in the JVM

The auto-baselined thresholds for blocked threads had however triggered critical alerts when the problem happened. Reviewing the historical data, we could see that there was some significant thread blocking in the JVM, which was abnormal and anomalous. A quick check in the detailed diagnosis tool in eG Enterprise (see Figure 8) revealed there were issues in SSL processing at the JVM level – not in the application code.

Figure 8: Detailed diagnosis shows that thread blocking happened in the JVM

Clicking through to identify the blocking thread, showed that SSL Memory Caching in the JVM seemed to have triggered the issue (see Figure 9). Many Java threads were stuck in SSL processing – and it was this that was consuming excessive CPU.

Figure 9: Thread blocking caused by synchronized access to the SSL Memory Cache in the JVM

By simply googling the blocked class/method, i.e., “sun.security.util.MemoryCache.put”, one can find a number of links that point to known Java issues:

Vendor-recommended changes, such as setting different cache sizes and timeouts were attempted but the problem was not getting resolved.

Perplexed! Where is the Root-Cause?

At this point, the customer asked for our advice as they were stumped. We performed some really obvious sanity checks. Often, a customer says they “haven’t changed anything” but when you check the eG Enterprise configuration database and historical data, it tells a very different story! In this case though absolutely nothing had changed in the application for several weeks. The application code had been working fine for months. No patches had been deployed at the application level or the OS level. It was all rather mysterious!

Could there have been an SSL attack? No – TCP connections’ activity didn’t change by much (see Figure 10):

Figure 10: TCP connection activity to the application

We checked what other alerts had been triggered around the time of the incident. BINGO! We got our clue – TCP retransmissions had increased significantly around the same time (see Figure 11).

Figure 11: TCP retransmissions from the application server – culprit found!

1 to 5 segments retransmitted per second used to be normal for this application. Observe that the value had risen to over 200 segments/sec when the problem happened (see Figure 11). That’s a 40-fold increase! There was also a clear correlation between TCP Retransmissions increased right around the time system CPU shot up. eG Enterprise uses historical data within an AIOps engine to set meaningful thresholds to enable anomaly detection. By learning what is normal for an application or infrastructure, anomalies within dynamic environments can often be detected early (read more).

A primer on TCP packet retransmissions and how they impact system performance

What are TCP Retransmissions?

TCP communication

The TCP enables two hosts to establish a connection and exchange streams of data. TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order in which they were sent.

TCP retransmission refers to the process of resending packets over the network that have been either lost or damaged. Retransmission is a mechanism used by TCP to provide reliable communication.

When does retransmission happen?

TCP retransmission Retransmission happens when the TCP receiver determines that an error has occurred during communication and subsequently does not transmit an “ACK” (acknowledgment) to the sender.

The sender will then retransmit the lost or damaged packet. Once the receiver determines that it has received a packet successfully, an “ACK” will be sent to the sender.

Why do TCP retransmission happen?

Why TCP retransmissions happen

TCP retransmissions can be caused by a number of networking issues:

Poor or lossy network connection is a common cause.
Faulty NIC card or driver on the sender or recipient OS can result in packet losses.
Issues with firewalls and proxies that lie on the path between the sender and the receiver can cause retransmissions.
When a router on the intervening network path is heavily loaded, it might have buffer overruns leading to lost packets.
Network congestion in a LAN can also cause network packet loss.
Different TCP segments from the sender can take different routes to reach the receiver and the delays between the routes could be so significant that the receiver believes it cannot handle a large number of out of order packets.
In a virtual environment, hypervisor issues can also lead to packet loss during VM-to-VM communication.

How do they impact system performance?

When packet loss over the network is significant, it can result in several abnormalities:

Existing SSL connections could be dropped, and new connections must be established. This adds to latency and also processing overhead on the endpoints. SSL/ TLS handshakes which have to happen for each new connection involve certificate exchanges between the endpoints and they take up CPU resources.

Applications on the end points may have a cache of SSL connections currently established. When network issues happen, they cause connections to linger for longer. Connections could also stay in the negotiation phase for longer. This could result in the memory cache of SSL connections on the endpoints being larger than usual. This could result in increased memory requirements on the end points. As the memory requirement increases, it could trigger garbage collection which could trigger CPU usage. Also, larger the memory cache, more time it takes for concurrent accesses to the cache and to reorganize the cache when a connection starts or ends. This also adds to the CPU usage on the endpoints.

Best practice: Track logs, metrics, traces and events in a unified manner, so you get the complete picture of performance anomalies in a system.

The customer had a premier support contract with the cloud service provider. When they contacted the service provider’s support desk, they were told that there were no issues at their end. When the customer provided the data collected to show the excessive TCP retransmissions, the support desk suggested a system reboot to force the VM to move from one physical host to another.

And just like that, the problem went away! Immediately, TCP retransmissions dropped and CPU usage went down (see Figure 12).

Figure 12: CPU usage of the application dropped immediately after the VM was moved to a new host

Just to check, when the VM was moved back to the original host, the problem returned again (see Figure 13). Based on this behavior, we suspect that it may have been a malfunctioning NIC card on the physical server, or a driver issue on that server rather than a wider networking issue in the cloud provider’s data center.

Figure 13: TCP retransmissions dropped after the VM was moved from one host to another

Most on-prem admins will have experience with weird/challenging VMware ESXi, Citrix Hypervisor, or Microsoft Hyper-V bugs with intermittent or perplexing symptoms that impact application performance. Imagine trying to debug such an issue with no access to the hypervisor – that’s what you may have to do on public cloud!

Based on the analysis of this problem, the customer submitted a helpdesk report to the cloud provider with evidence from eG Enterprise to get a credit for the hours when the application performance issue had occurred.

Incident postmortem of the application performance anomaly

There was a network issue on the physical host used by the cloud service provider. This possibly affected TCP connection handling on the VMs hosted on the host.
The network issue caused retransmissions at the TCP level.
Retransmission during SSL handshakes meant there were more SSL connections waiting to be processed in the application stack.
When the SSL connection cache in the JVM became too high, it caused synchronization issues in the JVM. This caused CPU usage to spike up and the application became slow and unavailable at times.
When the network issue was circumvented by moving to another host, application performance was back to normal.

Application Performance Troubleshooting: Key Takeaways

The real-life story we have described here highlights the challenges that organizations that are migrating applications to the cloud face. Here are my five key takeaways regarding application performance on the cloud:

Operating applications in cloud environment is challenging – you no longer have complete visibility. When there is an issue, you often hear “it’s not us” from your cloud service provider. If you are thinking “I will go to the cloud and won’t have any performance issues anymore”, think again.
You must have full stack visibility. You can’t just monitor the application alone. In a cloud environment, you need as much proof as possible when you speak to your cloud service provider.
Monitor as many parameters as possible. You never know where you will get a clue to help diagnose a problem. We knew TCP retransmissions tend to affect application performance, however we didn’t necessarily anticipate such a huge CPU impact because of them. Tools that selectively pick a handful of KPIs and report on them will catch the obvious issues or more common root causes that you yourself might find out in a few minutes. You need as much visibility as possible, so that you can provide proof when you contact your cloud service provider.
Historical insights are extremely important. You are often asked “what changed” – it is important to track config changes and to know what was updated. Your application needs to have audit logging so that you know what changed within the application. You need to monitor the application config and OS config so that you know what patches, hot fixes, or config changes were made, so you can correlate any performance issues with config changes. At the same time, you need to also have usage and performance baselines for your infrastructure to know what normal usage and performance looks like. At several times during our analysis, we checked on these statistics and used them for anomaly detection.
And YES sometimes – It’s not you, it’s the cloud!

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.

Free Trial See the platform

Case Study: AWS Cloud Application Performance Troubleshooting

About the Author

You may also like

Monitoring AWS billing costs with AWS tags

What is Amazon Inspector? Monitoring and Alerting with Amazon Inspector

Monitoring AWS billing costs with AWS tags

Fabric interconnect: connecting servers with ucs hardware

Monitoring ecs metrics: a guide for developers and operations teams

	Arun Aravamudhan Published on: December 9, 2021 Last updated on: July 4, 2025
Book a Demo

Case Study: AWS Cloud Application Performance Troubleshooting

How a full stack monitoring solution helped our customer with Application Performance Troubleshooting on AWS Cloud

Summary

The Cloud Does Not Make Application Performance Monitoring and Troubleshooting Simpler

Target Application and Infrastructure on AWS Cloud

The Initial Alert

Analyzing Application Performance in Detail

The impact of Java Garbage Collection on application performance

3 Use Cases for which continuous thread analysis is key

Perplexed! Where is the Root-Cause?

A primer on TCP packet retransmissions and how they impact system performance​

Incident postmortem of the application performance anomaly

Application Performance Troubleshooting: Key Takeaways

Further reading

About the Author

You may also like

Monitoring AWS billing costs with AWS tags

What is Amazon Inspector? Monitoring and Alerting with Amazon Inspector

Related Blogs

Monitoring AWS billing costs with AWS tags

Fabric interconnect: connecting servers with ucs hardware

Monitoring ecs metrics: a guide for developers and operations teams

A primer on TCP packet retransmissions and how they impact system performance