I've been developing concurrent
systems for several years now, and I
have a pretty good grasp on the
subject despite my lack of formal
training (i.e. no degree).
Many of best programmers I know didn't finish the University.
As for me I studied Philosophy.
C/C++, C#, Java, etc.). In particular,
it can be near impossible to recreate
conditions that happen readily on one
system in your development
environment.
yes
How do you figure out what can be made concurrent vs. what has to be
sequential?
we usually start with a 1000 miles high metaphor to clarify our architecture to ourselves (firstly) and to others (secondly).
When we faced that problem, we always found a way to limiting the visibility of concurrent objects to non concurrent ones.
Lately I discovered Actors in scala and I saw that my old solutions were a kind of "miniactors", much less powerful than scala ones. So my suggestion is to start from there.
Another suggestion is to skip as many problems as possible: for example we use centralised cache (terracotta) instead of keeping maps in memory, using inner class callbacks instead of synchronised methods, sending messages instead of writing shared memory etc.
With scala it's all much easier anyway.
How do you reproduce error conditions and view what is happening
as the application executes?
No real answer here. We have some unit test for concurrency and we have a load test suite to stress the application as much as we can.
How do you visualize the interactions between the different
concurrent parts of the application?
Again no real answer: we design our Metaphor on the whiteboard and we try to make sure there are no conflicts on the architectural side.
For Arch here I mean the Neal Ford's definition: Sw Architecture is everything that will be very hard to change later.
programming leads me to believe you
need a different mindset than you do
with sequential programming.
Maybe but for me it's simply impossible to think in a parallel way, so better design our software in a way that doesn't require parallel thinking and with clear guardrails to avoid crashes between concurrency lanes.
I don't think you can do this in the general case - detecting arbitrary deadlocks/livelocks in a complex system is equivalent to the halting problem so you haven't got a hope of solving it. Recovery from such situations can also be arbitrarily complex, and it's almost impossible to return the system to a "safe" state. My overall advice would be to fix the underlying architectural issues rather than trying to paper over the problem with some flawed form of automatic deadlock/livelock detection and recovery.
Basically you are trying to solve the wrong problem - the deadlocks aren't the issue, your architecture and development approach is.
By the way, if you are concerned about clients with availability SLAs, then implementing an automated deadlock detection and resolution system is one of the worst things you can do, since this could potentially corrupt your client's data (the reason you have locks in the first place is to stop data getting corrupted by concurrent transactions!).
Think about how the conversation will so: "so let me get this straight - you implemented a deadlock resolution strategy which silently corrupts our data and pretends everything is ok so that you were able to hit your SLA target?" You could be in for a pretty big lawsuit if this happens, a missed SLA is peanuts in comparison....
FWIW, I think that lock-based programming is the wrong approach anyway for complex systems. Ideally you want to make everything stateless, but if you really need mutable state then a software transactional memory based approach is IMHO the right way to handle this. STMs used correctly can't deadlock since they don't require locks. This excellent video presentation describes Clojure's STM system which is an example of what is possible in this space.
Best Answer
To all practical purposes, you can't. You can run a number of successful tests numerous times, then without making a change, a test will fail. This is what makes testing multi-threading hard - it is not deterministic.
You may be able to test to an acceptable degree of certainty using statistical methods i.e. we ran 10^y randomized tests with no faults found- it is statistically probable that there are no defects.
You cannot run a test that guarantees thread safety, it has to be done by design and white box testing that the design has been implemented correctly.
As far as ConcurrentHashMap - if your vendor says something is thread safe, you can only decide, do I trust my vendor - or perhaps paraphrasing a great movie line "You've got to ask yourself one question: Do I feel lucky? Well, do ya, punk?" - and if you don't trust your vendor, I think maybe you have bigger problems.
EDIT: My background involves Life critical and Hard Real time (sub nanosecond) embedded systems. My questions is answered in the context of "Whats the worst that happens" being somewhat more important than "an unexplained software crash"......