Java – How to avoid halting of the JVM due to a deadlock in java

concurrencyjava

I have read many websites talking about how to avoid and how to design etc. I completely understand those strategies.

My question is based on the following preconditions:

You have a company with 1000's of developers.
There are different teams working on the same product but as modules.
New developers writing new code not knowing the overall system, please consider an Enterprise application.
High available software development where a downtime of 15 mins is considered as an SLA violation.

I could write few more preconditions but I thought these could be strong enough to support my question about why I might need a recovering strategy for a "Deadlock" in a software.

Please note that re-designing the modules whenever we find a deadlock is not realistic.

Now this being said.

Can someone take sometime to provide an input or brainstorm on an idea of how to resolve a deadlock if at all it happens, so that we can report it and move forward, instead of halting completely.

Run a deadlock detector that runs periodically to look for deadlocks in the system.
If a deadlock is detected, notify with an event to resolve the deadlock.
The deadlock event listener will then kick in and act upon the deadlocked threads.
For each thread identify the contention.
Write an intelligent algorithm that could either release the locks and kill the thread or release the locks and re-evaluate the thread.
In step 2 we handle the notification in multiple ways, out of which logging is one of the listener.

I know how to go about steps 1,2,6. Will need help with 3,4 and 5.

I know that Oracle RDBMS already has a deadlock detection and resolution strategy in place, I wonder if they would ever share their strategies in this thread 🙂

_{Can't add my comment as an answer so adding it as a comment here.}

=================================================================

I completely understand the risk of killing the threads. I was 100% certain that I would get answers like this but I was also hoping that someone would suggest something new. I'll keep the thread open as there is no answer in here that I already do not know, thank you very much for trying though.

Best Answer

I don't think you can do this in the general case - detecting arbitrary deadlocks/livelocks in a complex system is equivalent to the halting problem so you haven't got a hope of solving it. Recovery from such situations can also be arbitrarily complex, and it's almost impossible to return the system to a "safe" state. My overall advice would be to fix the underlying architectural issues rather than trying to paper over the problem with some flawed form of automatic deadlock/livelock detection and recovery.

Basically you are trying to solve the wrong problem - the deadlocks aren't the issue, your architecture and development approach is.

By the way, if you are concerned about clients with availability SLAs, then implementing an automated deadlock detection and resolution system is one of the worst things you can do, since this could potentially corrupt your client's data (the reason you have locks in the first place is to stop data getting corrupted by concurrent transactions!).

Think about how the conversation will so: "so let me get this straight - you implemented a deadlock resolution strategy which silently corrupts our data and pretends everything is ok so that you were able to hit your SLA target?" You could be in for a pretty big lawsuit if this happens, a missed SLA is peanuts in comparison....

FWIW, I think that lock-based programming is the wrong approach anyway for complex systems. Ideally you want to make everything stateless, but if you really need mutable state then a software transactional memory based approach is IMHO the right way to handle this. STMs used correctly can't deadlock since they don't require locks. This excellent video presentation describes Clojure's STM system which is an example of what is possible in this space.

Related Solutions

Java and JVM Licensing Explained

You can write a compiler that implements the Java Language Specification or write a JVM that implements the Java Virtual Machine specification, but when you officially want to call it "Java", you have to prove it is compatible by passing the tests of the TCK (technology compatibility kit) and pay for a license from Oracle.

Oracle doesn't make it easy for other parties to do this, though. Apache has their own implementation of the JVM (Apache Harmony) but previously Sun, now Oracle, is not cooperating in making the TCK available nor let Apache get a license, which has led to a lot of resentment between Apache and Oracle.

Long ago Microsoft had their own version of Java (that was indeed called "Java"). They tried to change it to make it Windows-specific, which Sun of course didn't like. There was a lawsuit, Microsoft lost, quit their own Java version and created .NET, which is a completely different thing that just happens to work a lot like how Java works...

The lawsuit about Android isn't based on this at all; Google isn't saying that Android is Java. That lawsuit is about patents; Oracle has patents on a number of ideas and concepts in their own JVM implementation and is claiming that Google is using the same patented ideas in Android without getting a patent license from Oracle.

Error Handling Strategies in Multithreaded Environments – Architecture and Concurrency

My two cents.

First, most async models I've seen in libraries tend to make me frustrated. Everybody seems to have their own slightly different brand of async, and many of those interfaces are not good. As such, I tend to like libraries that keep everything synchronous. Note that callbacks can still be good. But keep all the logic on one thread; the application programmer often wants to think about threading apart from the task the library is trying to perform.

Second, a small library devoted to asynchronous code can be a very good thing - as long as that is its sole focus. A pattern that I've seen and liked in C# is to chain together actions on different threads, but write it in almost a single threaded manner with a fluent interface. (The new keyword await is somewhat along the same lines.) One common place where this comes up is in dispatches onto the UI thread. Then provide a way to handle exceptions at the end, almost like a catch block. So for example maybe something like this:

...
int expensiveResult=-1;
YourThreadLibrary
  .Background(()=>expensiveResult=DoLongRunningTaskToCreate())
  .UI(()=>UpdateUI(expensiveResult))
  .Exception(ex=>LogIt(ex));

Exceptions are the way to go in C# and you have GC, so your situation may be different. But the pattern may still make some sense.

I know this might seem too simplistic but these are the tools that I've seen be general enough to work across many problems. The nice thing is that a fluent interface like this is definitely open to extension if written properly so you can add your .ParallelFailOnAny(params Action[]) etc.

Best Answer

Related Solutions

Java and JVM Licensing Explained

Error Handling Strategies in Multithreaded Environments – Architecture and Concurrency

Related Topic