How a full stack monitoring solution helped our enterprise customer pinpoint the cause of slowness in AWS Cloud

The Cloud Does Not Make Performance Monitoring Simpler

Cloud is Not EasyCloud migration and digital transformation of business-critical applications on the cloud are on the rise. At the same time, some IT executives have the misconception that when applications are migrated to the cloud, monitoring their performance becomes easier. The cloud service provider takes care of most of the key elements that support the application, and hence, they believe that there are fewer challenges when applications are on the cloud vs. being deployed on-premises.

This is often far from true.

This case study exposes the challenges that organizations deploying applications on the cloud face when performance issues happen and how difficult it can be to diagnose such problems. While migration to the cloud does simplify and automate several routine administration activities, performance monitoring is one area that is not necessarily simplified!

The scenario covered may be useful for those evaluating the level of insight and troubleshooting capabilities various cloud tools can offer.

Target Application and Infrastructure in AWS Cloud

AWS. Java and SQL Server logos

Our customer is a mid-sized business that offers SaaS services to clients. The SaaS application was Java-based that was hosted on the AWS cloud, and relied on AWS RDS for the backend database. eG Enterprise was used by the customer to monitor the SaaS application and infrastructure.

The application had been operational for several months and there had been no complaints about performance.

The Initial Alert

The CPU usage of the VM hosting the key application had spiked all of a sudden around 5 am and remained high for several hours:

CPU usage graph

Figure 1: CPU usage of the VM had spiked suddenly

At this time, application performance also suffered. This was evident from eG Enterprise’s synthetic monitoring capability that the customer had configured to proactively detect issues before real users experience issues. Initially when the problem started, application response time was poor. Over time, you see gaps in the response time in the graph in Figure 2 because the application became unavailable.

Application response time graph

Figure 2: Response time of the application, measured using synthetic monitoring

Below is the graph of application availability during the same time scale. You can clearly see TCP connection availability dropping to zero several times indicating that users were not able to connect to the application. This information was also highlighted in critical alerts sent to administrators and displayed on the overview dashboards in eG Enterprise to enable actionable notification.

Application availability graph

Figure 3: Application availability measured using synthetic monitoring

Analyzing Application Performance in Detail

To further analyze the problem, we analyzed the application performance in detail. Figure 4 below shows the CPU usage of the JVM (Java Virtual Machine) used by the application on the problematic VM. The JVM CPU usage follows the same pattern as the VM’s CPU usage, indicating that the application’s JVM had been affected.

CPU usage of application JVM

Figure 4: CPU usage of the application’s JVM

High CPU usage within the JVM can have numerous causes and often it can be a symptom of bugs within the Java code of the application. High CPU is a symptom – not a root cause. Reasons for high CPU could range from host OS issues, poorly sized JVMs, memory leaks to code-level deadlocks. My colleague has written a blog covering some common problems, see: How to Troubleshoot Java CPU Usage Issues | JVM High CPU Threads (eginnovations.com).

The JVM Garbage Collector (GC) can also be a source of high CPU, especially if application memory leaks are at play. The JVM knowing that it is running low on key resources, such as heap memory can get itself into a state where it keeps trying to desperately reclaim memory. A quick check eliminated the possibility of a GC issue in this case. See Figure 5, which shows the historical usage of GC in the JVM. The percentage of time spent by the JVM on GC activities had not changed significantly during the problematic period. Hence, GC activities were not the reason why the application’s CPU usage had spiked.

GC activity of the Java application

Figure 5: GC activity of the Java application indicated no anomalies

Java threading issues are a common problem that cause application problems. So, the next step was to check if there were common threading issues present, and if so, were these root cause issues or simply manifestations and symptoms of issues elsewhere. Unfortunately, both are common possibilities!

Figure 6 shows the total number of threads in the JVM over time. While there had been an increase, the increase was not too significant. This indicated that application processing in the JVM was not the issue.

Application JVM threads

Figure 6: Tracking the number of threads in the application’s JVM

The Java Thread Analysis modules within eG Enterprise further indicated that no specific thread in the JVM was taking a lot of CPU. There were many threads, each taking a small amount of CPU – 1-2%, but a number of these threads caused the overall CPU to be high.

Thread diagnosis

Figure 7: Details of threads in the JVM

The auto-baselined thresholds for blocked threads had however triggered critical alerts when the problem happened. Reviewing the historical data, we could see that there was some significant thread blocking in the JVM, which was abnormal and anomalous. A quick check in the detailed diagnosis tool in eG Enterprise (see Figure 8) revealed there were issues in SSL processing at the JVM level – not in the application code.

Thread blocking in JVM

Figure 8: Detailed diagnosis shows that thread blocking happened in the JVM

Clicking through to identify the blocking thread, showed that SSL Memory Caching in the JVM seemed to have triggered the issue (see Figure 9). Many Java threads were stuck in SSL processing – and it was this that was consuming excessive CPU.

Thread blocking caused by synchronized access to SSL memory cache in the JVM

Figure 9: Thread blocking caused by synchronized access to the SSL Memory Cache in the JVM

By simply googling the blocked class/method, i.e., “sun.security.util.MemoryCache.put”, one can find a number of links that point to known Java issues:

Vendor-recommended changes, such as setting different cache sizes and timeouts were attempted but the problem was not getting resolved.

Perplexed! Where is the Root-Cause?

At this point, the customer asked for our advice as they were stumped. We performed some really obvious sanity checks. Often, a customer says they “haven’t changed anything” but when you check the eG Enterprise configuration database and historical data, it tells a very different story! In this case though absolutely nothing had changed in the application for several weeks. The application code had been working fine for months. No patches had been deployed at the application level or the OS level. It was all rather mysterious!

Could there have been an SSL attack? No – TCP connections’ activity didn’t change by much (see Figure 10):

Connection activity graph

Figure 10: TCP connection activity to the application

We checked what other alerts had been triggered around the time of the incident. BINGO! We got our clue – TCP retransmissions had increased significantly around the same time (see Figure 11).

= TCP retransmissions from application server

Figure 11: TCP retransmissions from the application server – culprit found!

1 to 5 segments retransmitted per second used to be normal for this application. Observe that the value had risen to over 200 segments/sec when the problem happened (see Figure 11). That’s a 40-fold increase! There was also a clear correlation between TCP Retransmissions increased right around the time system CPU shot up. eG Enterprise uses historical data within an AIOps engine to set meaningful thresholds to enable anomaly detection. By learning what is normal for an application or infrastructure, anomalies within dynamic environments can often be detected early (read more).

The customer had a premier support contract with the cloud service provider. When they contacted the service provider’s support desk, they were told that there were no issues at their end. When the customer provided the data collected to show the excessive TCP retransmissions, the support desk suggested a system reboot to force the VM to move from one physical host to another.

And just like that, the problem went away! Immediately, TCP retransmissions dropped and CPU usage went down (see Figure 12).

CPU usage diagram

Figure 12: CPU usage of the application dropped immediately after the VM was moved to a new host

Just to check, when the VM was moved back to the original host, the problem returned again (see Figure 13). Based on this behavior, we suspect that it may have been a malfunctioning NIC card on the physical server, or a driver issue on that server rather than a wider networking issue in the cloud provider’s data center.

TCP retransmission diagram

Figure 13: TCP retransmissions dropped after the VM was moved from one host to another

Based on the analysis of this problem, the customer submitted a helpdesk report to the cloud provider with evidence from eG Enterprise to get a credit for the hours when the application performance issue had occurred.

Conclusion

The real-life story we have described here highlights the challenges that organizations that are migrating applications to the cloud face. Here are my five key takeaways regarding application performance on the cloud:

  1. Operating applications in cloud environment is challenging – you no longer have complete visibility. When there is an issue, you often hear “it’s not us” from your cloud service provider. If you are thinking “I will go to the cloud and won’t have any performance issues anymore”, think again.
  2. You must have full stack visibility. You can’t just monitor the application alone. In a cloud environment, you need as much proof as possible when you speak to your cloud service provider.
  3. Monitor as many parameters as possible. You never know where you will get a clue to help diagnose a problem. We knew TCP retransmissions tend to affect application performance, however we didn’t necessarily anticipate such a huge CPU impact because of them. Tools that selectively pick a handful of KPIs and report on them will catch the obvious issues or more common root causes that you yourself might find out in a few minutes. You need as much visibility as possible, so that you can provide proof when you contact your cloud service provider.
  4. Historical insights are extremely important. You are often asked “what changed” – it is important to track config changes and to know what was updated. Your application needs to have audit logging so that you know what changed within the application. You need to monitor the application config and OS config so that you know what patches, hot fixes, or config changes were made, so you can correlate any performance issues with config changes. At the same time, you need to also have usage and performance baselines for your infrastructure to know what normal usage and performance looks like. At several times during our analysis, we checked on these statistics and used them for anomaly detection.
  5. And YES sometimes – It’s not you, it’s the cloud!

Further reading