How to Troubleshoot Java Application Slowness: A Real-World Example

In this blog, I’ll cover a real-world example of application performance troubleshooting a Java web app, hosted on JBoss Wildfly using Microsoft SQL as the backend database, including details of the analysis and diagnosis we had to perform in order to identify the root-cause of, and resolve, the performance issue. The issue manifested with symptoms including high CPU in the JBoss JVM, elevated garbage collection and database connection issues including timeouts – the issue was isolated to an issue in the JDBC database driver used by the application server and not in the database server itself. The full analysis below should give the reader insight into how to identify and eliminate other issues and a deeper understanding of the full stack between the JVM and database, and how even a small component of an application delivery chain, such as a driver, can significantly affect application performance.

Application Performance was Poor: Where was the Bottleneck?

The application in question was a Java-based web application, deployed using JBoss Wildfly as the web application server and utilizing Microsoft SQL as the backend database. SQL had been configured in an active/passive clustering mode for high availability. Java 12 was used for the application and jTDS was the JDBC driver used by the application to access the database.

The application had been newly deployed in production and had been working acceptably until user load increased. With the increase in load, the application became slower and slower. The customer’s IT team did not have the expertise to assist with the diagnosis and our professional services team was engaged to assist them in identifying the cause of the application slowdown.

Synthetic monitoring of the web application using eG Enterprise indicated that there was a problem. While web response times were acceptable, there were many connection timeouts seen and as a result the JBoss web server availability metrics indicated problems occurring multiple times daily (see Figure 1).

Web front-end performance

Figure 1: Web front-end performance measured by synthetic monitoring – as measured over 4 days

Troubleshooting Whether the Application Tiers were Slow

Agents were deployed on all the key tiers of the application to get an inside view of performance. JMX (Java Management Extensions) was used to track the Java virtual machine performance. CPU usage of the JBoss JVM was seen to have spiked significantly with increased load (see Figure 2).

measurement of Java CPU usage

Figure 2: Measurement of the CPU usage of the Java virtual machine of the JBoss application server

Analysis of the JVM’s garbage collectors indicated that garbage collection was taking more time than previously. Garbage collections can also increase if there is a memory leak in the application; however, our analysis did not find evidence to indicate that there was a memory leak. Excessive garbage collections can increase a JVM’s CPU usage, so CPU usage could have been an effect and garbage collection the trigger for this. The next question was why GC activity had gone up?

Java Garbage collection activity

Figure 3: Garbage collection activity inside the JVM

Around the time CPU usage and GC activity had increased, we also observed a significant increase in the number of connections from the application server to the database server. During normal operation of the application, only a small number connections (10 or so) were used. But when the application slowdown happened, almost a thousand connections to the database were in use. This seemed to indicate that there was a problem in the either the database tier or communication with that tier.

Database connections from JBoss tier

Figure 4: Database connections from the JBoss tier to the SQL database server

End-to-End monitoring of applications

Is the Database Tier Slow?

When this finding was reported, the database team were adamant that the database tier itself was functioning correctly. All the monitoring and diagnosis tools they were using did not indicate any slowness on their database servers.

The SQL database had been configured in an active/passive cluster and SSL was enabled for all connections. Domain authentication was configured so only domain authenticated users could access the database server.

CPU utilization of database cluster

Figure 5: CPU utilization of the database cluster shows there is no resource bottleneck in the database tier

Monitoring of resource utilization levels on the database servers did indicate that the database team was correct and provided data to validate their conclusions. CPU utilization was well below 1% (see Figure 5). There was also no evidence of any long running queries, query blocking, high I/O activity like full table scans, etc. So where was the issue? This was the perplexing question.

External monitoring of the database server gave a clue. See Figure 6. This indicated that when a connection to the database server was initiated, a query was issued and the response processed, over 99% of the time was spent in connection establishment. The query itself was executed very quickly. This was surprising to all the teams involved.

Database response time details

Figure 6: Breakdown of database response time as shown in the eG Enterprise console. You can see that the connection establishment time is several times higher than the query processing time.

Where was the Problem Then?

Although the database tier was working well, there were thousand plus connections used by the application to access the database tier. To investigate this further, we checked the status of Java threads in the application server’s JVM. We could see several hundred threads were stuck waiting on the jTDS JDBC driver (see Figure 7). Reading the stack traces bottom up, we could see that when the application used the JDBC driver to connect to the database, the driver had issued a call to the getMAC function which in turn used Java’s NetworkInterface class. All of the Java threads on the application server were waiting on the Java NetworkInterface class.

Java threads running in an application server

Figure 7: Several hundred threads in the application server were waiting to establish connections to the database tier

The Java thread analysis corroborated our other evidence – that for some reason, connecting to the database server was taking far more time than the execution of a query on the database tier. The fact that the Java threads were waiting on a local method call (java.net.NetworkInterface.getByInetAddress0) and not on the database server indicated that the performance bottleneck seemed to be on the application server end, i.e. within the JDBC driver used by the application server and not in the database server itself.

Diagnosing and Resolving the Application Performance Issue

We were able to identify the issue by analyzing the JDBC driver’s code. Every time a connection was required to the database, the socket’s local address was obtained and through it the local network interface address was obtained (see lines 296 of Figure 8). The hardware address for this network interface was determined and the driver was communicating this information back to the database server (see lines 297 of Figure 8). This was being done because SQL server is licensed by unique clients accessing it and the database driver is identifying the client using the MAC address of the network interface it is using.

JDBC database driver code causes performance bottleneck

Figure 8: The jTDS database driver code that caused the performance bottleneck

What we discovered was that if the system on which the jTDS driver was executing (i.e., the client system accessing the database server) had a few network interfaces, finding the MAC address could be done in a few milliseconds. However, as the number of network interface cards on the client system increased, so did the time taken to find the MAC address!

So the performance bottleneck was in the database driver used by the application server and not in the database server itself!

A closer look at the code revealed that the IP address to hardware address resolution is done every time a new connection is established. The application server used database connection pooling. During normal circumstances, connections were pre-loaded in the pool and the application performance was good because connection establishment was done when the application started and not when users accessed the application. When the application server had high load, additional connections were needed, and it was at this time that new connections were attempted to the database server. It was when there were additional requests to be processed, the database connection establishment became a problem and it was this that caused the slowness in application access.

To resolve the problem, the database driver code was modified. A mapping of local IP address to hardware address was maintained in a memory cache in the database driver. Since the application only used the same local address each time, it meant that the hardware address was obtained from the memory cache itself. This saved significant time during database connection establishment.

Another option, to address the same problem, would have been to detect the MAC address manually and to specify this in the jTDS URL format. See the jTDS URL format below and a specification of how the MAC address can be provided in the URL itself.

jTDS driver for Java applications

After our change, you can see from Figure 9 that response times for accessing the SQL database cluster decreased significantly. There were also fewer SQL connection problems as seen by SQL availability remaining at 100%.

Performance of the SQL database cluster

Figure 9: Performance of the SQL database cluster after the JDBC driver change. Notice the significant reduction in availability issues and in database response time

Improvements were seen in the web application’s availability as well. Notice from Figure 10 that there have been no application availability drops after the JDBC driver modification discussed above. The number of simultaneous database connections used also dropped significantly and this directly resulted in a drop in CPU utilization of the application’s JVM – see Figure 11.

Web application access availability

Figure 10: Web application access availability also improved after the JDBC driver change

JVM CPU usage reduction

Figure 11: Significant reduction in CPU time used by the application server’s JVM

Troubleshoot Java application issues

End-to-End Monitoring: A Must for Effectively Resolving Performance Issues

The real-world example I have covered in this blog highlights the challenges in diagnosing and resolving performance issues with multi-tier applications.

The slowdown in this example resulted from an inefficiency in the JDBC database driver, an inconspicuous component in the application architecture. Yet, the impact of this inconspicuous component on application performance was huge.

Modern applications make use of many third-party libraries, drivers, modules, public domain components, and so on. As we saw in our example, inefficiencies in any of these components can greatly impact application performance. Pinpointing the application component that is responsible for a problem is a key to addressing the problem. To effectively monitor and troubleshoot application performance issues, it is important that every component of the application delivery chain must be monitored and performance anomalies highlighted.

We have shown how monitoring several key parameters and thread-level insights were important in identifying the problem in the example we provided. Application transaction tracing for Java, Microsoft .NET, and PHP applications can also provide great insights into any code-level, database, or third party component bottlenecks that impact application performance.

eG Enterprise includes extensive monitoring capabilities for more than 200+ application technologies and a large number of underlying infrastructure platforms. To view a complete list of supported platforms, see https://www.eginnovations.com/it-monitoring/technologies

eG Enterprise work model

Figure 12: How eG Enterprise works

Figure 12 highlights how eG Enterprise works. Performance metrics are collected using different means. From logs, from WMI, from performance counters, using APIs, etc. These metrics are analyzed and any performance anomalies are flagged. An AIOps engine analyzes the anomalies from each tier of the application delivery chain and correlates between these anomalies to determine where the root-cause of the problem lies. Intelligent and automated diagnosis makes it possible for even help desk staff to effectively triage problems and get the right domain experts involved. Doing so reduces the mean time to repair (MTTR) and enhances application availability and user satisfaction.

eG Enterprise is an Observability solution for Modern IT. Monitor digital workspaces,
web applications, SaaS services, cloud and containers from a single pane of glass.

For More Information: