Java – How to avoid halting of the JVM due to a deadlock in java

concurrencyjava

I have read many websites talking about how to avoid and how to design etc. I completely understand those strategies.

My question is based on the following preconditions:

  1. You have a company with 1000's of developers.
  2. There are different teams working on the same product but as modules.
  3. New developers writing new code not knowing the overall system, please consider an Enterprise application.
  4. High available software development where a downtime of 15 mins is considered as an SLA violation.

I could write few more preconditions but I thought these could be strong enough to support my question about why I might need a recovering strategy for a "Deadlock" in a software.

Please note that re-designing the modules whenever we find a deadlock is not realistic.

Now this being said.

Can someone take sometime to provide an input or brainstorm on an idea of how to resolve a deadlock if at all it happens, so that we can report it and move forward, instead of halting completely.

  1. Run a deadlock detector that runs periodically to look for deadlocks in the system.
  2. If a deadlock is detected, notify with an event to resolve the deadlock.
  3. The deadlock event listener will then kick in and act upon the deadlocked threads.
  4. For each thread identify the contention.
  5. Write an intelligent algorithm that could either release the locks and kill the thread or release the locks and re-evaluate the thread.
  6. In step 2 we handle the notification in multiple ways, out of which logging is one of the listener.

I know how to go about steps 1,2,6. Will need help with 3,4 and 5.

I know that Oracle RDBMS already has a deadlock detection and resolution strategy in place, I wonder if they would ever share their strategies in this thread 🙂

Can't add my comment as an answer so adding it as a comment here.

=================================================================

I completely understand the risk of killing the threads. I was 100% certain that I would get answers like this but I was also hoping that someone would suggest something new. I'll keep the thread open as there is no answer in here that I already do not know, thank you very much for trying though.

Best Answer

I don't think you can do this in the general case - detecting arbitrary deadlocks/livelocks in a complex system is equivalent to the halting problem so you haven't got a hope of solving it. Recovery from such situations can also be arbitrarily complex, and it's almost impossible to return the system to a "safe" state. My overall advice would be to fix the underlying architectural issues rather than trying to paper over the problem with some flawed form of automatic deadlock/livelock detection and recovery.

Basically you are trying to solve the wrong problem - the deadlocks aren't the issue, your architecture and development approach is.

By the way, if you are concerned about clients with availability SLAs, then implementing an automated deadlock detection and resolution system is one of the worst things you can do, since this could potentially corrupt your client's data (the reason you have locks in the first place is to stop data getting corrupted by concurrent transactions!).

Think about how the conversation will so: "so let me get this straight - you implemented a deadlock resolution strategy which silently corrupts our data and pretends everything is ok so that you were able to hit your SLA target?" You could be in for a pretty big lawsuit if this happens, a missed SLA is peanuts in comparison....

FWIW, I think that lock-based programming is the wrong approach anyway for complex systems. Ideally you want to make everything stateless, but if you really need mutable state then a software transactional memory based approach is IMHO the right way to handle this. STMs used correctly can't deadlock since they don't require locks. This excellent video presentation describes Clojure's STM system which is an example of what is possible in this space.