Troubleshooting Java Application Deadlocks – Diagnosing ‘Application Hang’ situations

Users expect applications to respond instantly. Deadlocks in Java applications result in ‘application hang’ situations that result in unresponsive systems and poor user experience.deadlock

This blog post explains what deadlocks are, consequences of deadlocks and options to diagnose them.

In a subsequent blog post, we’ll explore how the eG Java Monitor helps in pinpointing deadlock root causes down to the code level.

 

A typical production scenario

It is 2 am in the morning and you get woken up by a phone call from the helpdesk team. The helpdesk is receiving a flood of calls from application users. The application is reported to be slow and sluggish. Users are complaining that the browser keeps spinning and eventually all they see is a ‘white page’.Graphic of sys admin having to troubleshoot at night

Still somewhat heavy-eyed, you go through the ‘standard operating procedure’. You notice that no TCP traffic is flowing to or from the app server cluster. The application logs aren’t moving either.

You are wondering what could be wrong when the VP (Operations) pings you over Instant Messenger asking you to join a war room conference call. You will be asked to provide answers and pinpoint the root cause – fast.

What are Java application deadlocks?

A deadlock occurs when two or more threads form a cyclic dependency on each other as shown below.

In this illustration ‘thread 2’ is in a wait state waiting on Resource A owned by ‘thread 1’, while ‘thread 1’ is in a wait state waiting on Resource B owned by ‘thread 2’.Graphic of deadlocked threads having a circular dependency

In such a condition, these two threads are ‘hanging’ indefinitely without making any further progress.

This results in an “application hang” where the process is still present but the system does not respond to user requests.

“The JVM is not nearly as helpful in resolving deadlocks as database servers are.thumbnail graphic of java_concurrency_in_practice_book

When a set of Java threads deadlock, that’s the end of the game. Depending on what those threads do, the application may stall completely”

Brian Göetz et al, authors of “Java Concurrency in Practice”

Consequences of deadlocks

1. Poor user experience

When a deadlock happens, the application may stall. Typical symptoms could be “white pages” in web applications while the browser continues to spin eventually resulting in a timeout.graphic of browser timeout

Often, users might try to retry their request by clicking refresh or re-submitting a form submit which compounds the problem further.

2. System undergoes exponential degradation

When threads go into a deadlock situation,graphic of long-queue-people they take longer to respond. In the intervening period, fresh set of requests may arrive into the system.

graphic of how fresh requests would cause exponential system degradation due to backlogged threads

When deadlocks manifest in app servers, fresh requests will get backed up in the ‘execution queue’. Thread pool will hit the max utilization thereby denying new requests to get served. This causes further exponential degradation on the system.

 3. Cascading impact on the entire app server cluster

In multi-tier applications, Web Servers (such as Apache or IBM HTTP Server) receive requests and forward it to Application Servers (such as WebLogic, WebSphere or JBoss) via a ‘plug-in’ .cascading effect

If the plug-in detects that the Application Server is unhealthy, it will “fail-over” to another healthy application server which will accept heavier loads than usual thus resulting in further slowness.

This may cause a cascading slowdown effect on the entire cluster.

Why are deadlocks difficult to troubleshoot in a clustered, multi-tier environment?

Application support teams are usually caught off-guard when faced with deadlocks in production environments.

hot-tip

1. Deadlocks typically do not exhibit typical symptoms such as a spike in CPU or Memory. This makes it hard to track and diagnose deadlocks based on basic operating system metrics.

2. Deadlocks may not show up until the application is moved to a clustered environment. Single application server environments may not manifest latent defects in the code.

3. Deadlocks usually manifest in the worst possible time – heavy production load conditions. They may not show up in low or moderate load conditions. They are also difficult to replicate in a testing environment because of the same load condition reasons.

Options to diagnose deadlocks

There are various options available to troubleshoot deadlock situations.

1. The naïve way: Kill the process and cross your fingers

You could kill the application server process and hope that when the application server starts again, kill_processthe problem will go away.

However restarting the app server is a temporary fix that will not resolve the root-cause. Deadlocks would get triggered again when the app server comes back.

 

2. The laborious way: Take thread dumps in your cluster of JVMs

You could take thread dumps. To trigger a thread dump, we have to send a SIGQUIT signal. (On UNIX, that would be a “kill -3” command and on Windows, that would be a “Ctrl-Break” in the console).

Typically, you would need to capture a series of thread dumps (example: 6 thread dumps spaced 20 seconds apart) to infer any thread patterns – just a static thread dump snapshot may not suffice. thread_dump

If you are running the application server as a Windows service (which is usually the case), it is a little more complicated. If you are running the Hotspot JVM, you could use the jps utility in order to find the process id and then use the jstack utility in order to take thread dumps. You can also use the jconsole utility to connect to the process in question.

You would have to forward the thread dumps to the development team and wait for them to analyze and get back. Depending on the size of the cluster, there would be multiple files to trawl through and this might entail significant time.

This is not an optimal situation you want to be at 2 am in the morning when the business team is waiting on a quick resolution.

cautionManual Processes to troubleshoot deadlocks can be time consuming

The manual approach of taking thread dumps assumes that you know which JVM(s) are suffering from deadlocks.

Chances are that the application is hosted in a high-availability, clustered Application Server farm with tens (if not hundreds) of servers.

complex_architecture

If only a subset of JVMs are undergoing the deadlock problem, you may not be in a position to precisely know which JVM is undergoing thread contention or deadlocks. You would have to resort to taking thread dumps across all of your JVMs.

This becomes a trial-and-error approach which is laborious and time consuming. While this approach may be viable for a development or staging environment, it is not viable for a business-critical production environment where ‘Mean Time To Repair’ (MTTR) is key.

 

3. The smart way: Leverage an APM

While an APM (Application Performance Management) product cannot prevent deadlocks from happening, they can certainly provide deeper visibility into the root cause down to the code level when they do happen.Smart way

In the next blog post, we’ll explore how the eG Java Monitor can help provide an end-to-end perspective of the system in addition to pinpointing the root cause for deadlocks.

Stay tuned!

 

About the Author

Arun Aravamudhan has more than 15 years of experience in the Java enterprise applications space. In his current role with eG Innovations, Arun leads the Java Application Performance Management (APM) product line.