How a full stack monitoring solution helped our enterprise customer pinpoint the cause of slowness in AWS Cloud
The Cloud Does Not Make Performance Monitoring Simpler
Cloud migration and digital transformation of business-critical applications on the cloud are on the rise. At the same time, some IT executives have the misconception that when applications are migrated to the cloud, monitoring their performance becomes easier. The cloud service provider takes care of most of the key elements that support the application, and hence, they believe that there are fewer challenges when applications are on the cloud vs. being deployed on-premises.
This is often far from true.
This case study exposes the challenges that organizations deploying applications on the cloud face when performance issues happen and how difficult it can be to diagnose such problems. While migration to the cloud does simplify and automate several routine administration activities, performance monitoring is one area that is not necessarily simplified!
The scenario covered may be useful for those evaluating the level of insight and troubleshooting capabilities various cloud tools can offer.
Target Application and Infrastructure in AWS Cloud
Our customer is a mid-sized business that offers SaaS services to clients. The SaaS application was Java-based that was hosted on the AWS cloud, and relied on AWS RDS for the backend database. eG Enterprise was used by the customer to monitor the SaaS application and infrastructure.
The application had been operational for several months and there had been no complaints about performance.
The Initial Alert
The CPU usage of the VM hosting the key application had spiked all of a sudden around 5 am and remained high for several hours:
Figure 1: CPU usage of the VM had spiked suddenly
At this time, application performance also suffered. This was evident from eG Enterprise’s synthetic monitoring capability that the customer had configured to proactively detect issues before real users experience issues. Initially when the problem started, application response time was poor. Over time, you see gaps in the response time in the graph in Figure 2 because the application became unavailable.
Figure 2: Response time of the application, measured using synthetic monitoring
Below is the graph of application availability during the same time scale. You can clearly see TCP connection availability dropping to zero several times indicating that users were not able to connect to the application. This information was also highlighted in critical alerts sent to administrators and displayed on the overview dashboards in eG Enterprise to enable actionable notification.
Figure 3: Application availability measured using synthetic monitoring
Analyzing Application Performance in Detail
To further analyze the problem, we analyzed the application performance in detail. Figure 4 below shows the CPU usage of the JVM (Java Virtual Machine) used by the application on the problematic VM. The JVM CPU usage follows the same pattern as the VM’s CPU usage, indicating that the application’s JVM had been affected.
Figure 4: CPU usage of the application’s JVM
High CPU usage within the JVM can have numerous causes and often it can be a symptom of bugs within the Java code of the application. High CPU is a symptom – not a root cause. Reasons for high CPU could range from host OS issues, poorly sized JVMs, memory leaks to code-level deadlocks. My colleague has written a blog covering some common problems, see: How to Troubleshoot Java CPU Usage Issues | JVM High CPU Threads (eginnovations.com).
The JVM Garbage Collector (GC) can also be a source of high CPU, especially if application memory leaks are at play. The JVM knowing that it is running low on key resources, such as heap memory can get itself into a state where it keeps trying to desperately reclaim memory. A quick check eliminated the possibility of a GC issue in this case. See Figure 5, which shows the historical usage of GC in the JVM. The percentage of time spent by the JVM on GC activities had not changed significantly during the problematic period. Hence, GC activities were not the reason why the application’s CPU usage had spiked.
Figure 5: GC activity of the Java application indicated no anomalies
Java threading issues are a common problem that cause application problems. So, the next step was to check if there were common threading issues present, and if so, were these root cause issues or simply manifestations and symptoms of issues elsewhere. Unfortunately, both are common possibilities!
Figure 6 shows the total number of threads in the JVM over time. While there had been an increase, the increase was not too significant. This indicated that application processing in the JVM was not the issue.
Figure 6: Tracking the number of threads in the application’s JVM
The Java Thread Analysis modules within eG Enterprise further indicated that no specific thread in the JVM was taking a lot of CPU. There were many threads, each taking a small amount of CPU – 1-2%, but a number of these threads caused the overall CPU to be high.
Figure 7: Details of threads in the JVM
The auto-baselined thresholds for blocked threads had however triggered critical alerts when the problem happened. Reviewing the historical data, we could see that there was some significant thread blocking in the JVM, which was abnormal and anomalous. A quick check in the detailed diagnosis tool in eG Enterprise (see Figure 8) revealed there were issues in SSL processing at the JVM level – not in the application code.
Figure 8: Detailed diagnosis shows that thread blocking happened in the JVM
Clicking through to identify the blocking thread, showed that SSL Memory Caching in the JVM seemed to have triggered the issue (see Figure 9). Many Java threads were stuck in SSL processing – and it was this that was consuming excessive CPU.
Figure 9: Thread blocking caused by synchronized access to the SSL Memory Cache in the JVM
By simply googling the blocked class/method, i.e., “sun.security.util.MemoryCache.put”, one can find a number of links that point to known Java issues:
Vendor-recommended changes, such as setting different cache sizes and timeouts were attempted but the problem was not getting resolved.
Perplexed! Where is the Root-Cause?
At this point, the customer asked for our advice as they were stumped. We performed some really obvious sanity checks. Often, a customer says they “haven’t changed anything” but when you check the eG Enterprise configuration database and historical data, it tells a very different story! In this case though absolutely nothing had changed in the application for several weeks. The application code had been working fine for months. No patches had been deployed at the application level or the OS level. It was all rather mysterious!
Could there have been an SSL attack? No – TCP connections’ activity didn’t change by much (see Figure 10):
Figure 10: TCP connection activity to the application
We checked what other alerts had been triggered around the time of the incident. BINGO! We got our clue – TCP retransmissions had increased significantly around the same time (see Figure 11).
Figure 11: TCP retransmissions from the application server – culprit found!
1 to 5 segments retransmitted per second used to be normal for this application. Observe that the value had risen to over 200 segments/sec when the problem happened (see Figure 11). That’s a 40-fold increase! There was also a clear correlation between TCP Retransmissions increased right around the time system CPU shot up. eG Enterprise uses historical data within an AIOps engine to set meaningful thresholds to enable anomaly detection. By learning what is normal for an application or infrastructure, anomalies within dynamic environments can often be detected early (read more).
The customer had a premier support contract with the cloud service provider. When they contacted the service provider’s support desk, they were told that there were no issues at their end. When the customer provided the data collected to show the excessive TCP retransmissions, the support desk suggested a system reboot to force the VM to move from one physical host to another.
And just like that, the problem went away! Immediately, TCP retransmissions dropped and CPU usage went down (see Figure 12).
Figure 12: CPU usage of the application dropped immediately after the VM was moved to a new host
Just to check, when the VM was moved back to the original host, the problem returned again (see Figure 13). Based on this behavior, we suspect that it may have been a malfunctioning NIC card on the physical server, or a driver issue on that server rather than a wider networking issue in the cloud provider’s data center.
Figure 13: TCP retransmissions dropped after the VM was moved from one host to another
Based on the analysis of this problem, the customer submitted a helpdesk report to the cloud provider with evidence from eG Enterprise to get a credit for the hours when the application performance issue had occurred.
The real-life story we have described here highlights the challenges that organizations that are migrating applications to the cloud face. Here are my five key takeaways regarding application performance on the cloud:
- Operating applications in cloud environment is challenging – you no longer have complete visibility. When there is an issue, you often hear “it’s not us” from your cloud service provider. If you are thinking “I will go to the cloud and won’t have any performance issues anymore”, think again.
- You must have full stack visibility. You can’t just monitor the application alone. In a cloud environment, you need as much proof as possible when you speak to your cloud service provider.
- Monitor as many parameters as possible. You never know where you will get a clue to help diagnose a problem. We knew TCP retransmissions tend to affect application performance, however we didn’t necessarily anticipate such a huge CPU impact because of them. Tools that selectively pick a handful of KPIs and report on them will catch the obvious issues or more common root causes that you yourself might find out in a few minutes. You need as much visibility as possible, so that you can provide proof when you contact your cloud service provider.
- Historical insights are extremely important. You are often asked “what changed” – it is important to track config changes and to know what was updated. Your application needs to have audit logging so that you know what changed within the application. You need to monitor the application config and OS config so that you know what patches, hot fixes, or config changes were made, so you can correlate any performance issues with config changes. At the same time, you need to also have usage and performance baselines for your infrastructure to know what normal usage and performance looks like. At several times during our analysis, we checked on these statistics and used them for anomaly detection.
- And YES sometimes – It’s not you, it’s the cloud!
- If you enjoyed this Postmortem blog post – you may enjoy this similar one, Troubleshooting Web Application Performance & SSL Issues
- An overview of Java Performance Monitoring Tools, which enable you to prioritize problems automatically and provide actionable notifications
- Top 10 Java Performance Problems – An in-depth guide to the most common Java issues and identifying them
- Monitoring SSL Certificates in Business-Critical Applications (eginnovations.com)
- Section 9 “Monitoring TCP Activity” in our troubleshooting guide details debugging and understanding TCP retransmission issues and their causes, see: Server Performance Monitoring – KPIs & Metrics
- My previous deep-dive post-mortem blog post – debugging slow performance on AWS public cloud burstable instances on EC2, see: AWS EC2 Monitoring Tools | eG Innovations
- More on how eG Enterprise leverages AIOps technologies for anomaly detection: AIOps Tools – 8 Proactive Monitoring Tips | eG Innovations