Error Handling Strategies in Multithreaded Environments – Architecture and Concurrency

Architectureconcurrencyerror handlinglibrariesmultithreading

TL;DR What error generating and handling strategies do you use in Multithreaded code intended for use by others and why do you use them? If applicable, state what programming paradigm it's useful for. I'm more interested in imperative, concurrent environments but any in general will be useful.

I'm writing a little concurrency library that's currently a pet/C++11 learning project but may be used internally by my work later. In terms of the domain it's more in the realm of DSP and media streaming but since this will be used in a game engine I need fairly strong error handling.

My big block at the moment is not getting my head around the parallel code and data structures but how to do error handling and reporting. My main experience in large systems is games but I'm usually using libraries, not designing them. I'm just looking for different strategies and how they might be used in different situations as this is a rather big gap in my knowledge.

My biggest area of concern is employing a strategy so that if the program can recover, it should. If it does recover, there should be a way to notify a user as to what's happened through some kind of mechanism. Already I have some some data structures that although they can recover, memory may be leaked if a destructor fails for example.

Some Approaches:

  • Handle exceptions for which you can safely recover from inside the library, but let fatal exceptions propagate to users of the library to indicate that object is now in an undefined state. It's my preferred approach in single threaded environments but this approach won't communicate bad state to other threads.
  • When an exception occurs that would put a data structure in an unrecoverable state, tear down the data structure and set a flag to block any more operations over that data structure then raise a general exception to the end user. This is hard for lock-free algorithms.
  • Forward an error state through parallel computations. Works well for Kahn process networks and other high level concurrency models. No so helpful if a primitive supporting the high level model has failed.
  • Terminate the thread/task that caused the exception. Works well for thread local data/computation but not much of a solution for shared data.

Just as a note, I know that a good library will probably use a mix of more than just what's listed above. I just don't have the experience to know what strategy is good where for any sufficiently large system.

Best Answer

My two cents.

First, most async models I've seen in libraries tend to make me frustrated. Everybody seems to have their own slightly different brand of async, and many of those interfaces are not good. As such, I tend to like libraries that keep everything synchronous. Note that callbacks can still be good. But keep all the logic on one thread; the application programmer often wants to think about threading apart from the task the library is trying to perform.

Second, a small library devoted to asynchronous code can be a very good thing - as long as that is its sole focus. A pattern that I've seen and liked in C# is to chain together actions on different threads, but write it in almost a single threaded manner with a fluent interface. (The new keyword await is somewhat along the same lines.) One common place where this comes up is in dispatches onto the UI thread. Then provide a way to handle exceptions at the end, almost like a catch block. So for example maybe something like this:

...
int expensiveResult=-1;
YourThreadLibrary
  .Background(()=>expensiveResult=DoLongRunningTaskToCreate())
  .UI(()=>UpdateUI(expensiveResult))
  .Exception(ex=>LogIt(ex));

Exceptions are the way to go in C# and you have GC, so your situation may be different. But the pattern may still make some sense.

I know this might seem too simplistic but these are the tools that I've seen be general enough to work across many problems. The nice thing is that a fluent interface like this is definitely open to extension if written properly so you can add your .ParallelFailOnAny(params Action[]) etc.

Related Topic