SOA Exception Handling – Best Practices for SOA Exception Handling

soasoapwcfweb services

Here's some interesting debate going on between me and my colleague when coming to handle SOA exceptions:

On one side, I support what Juval Lowy said in Programming WCF Services 3rd Edition:

As stated at the beginning of this chapter, it is a common illusion
that clients care about errors or have anything meaningful to do when
they occur. Any attempt to bake such capabilities into the client
creates an inordinate degree of coupling between the client and the
object, raising serious design questions. How could the client
possibly know more about the error than the service, unless it is
tightly coupled to it? What if the error originated several layers
below the service—should the client be coupled to those lowlevel
layers? Should the client try the call again? How often and how
frequently? Should the client inform the user of the error? Is there a
user? By having all service exceptions be indistinguishable from one
another, WCF decouples the client from the service. The less the
client knows about what happened on the service side, the more
decoupled the interaction will be.

On the other side, here's what my colleague suggest:

I believe it’s simply incorrect, as it does not align with best
practices in building a service oriented architecture and it ignores
the general idea that there are problems that users are able to
recover from, such as not keying a value correctly. If we considered
only systems exceptions, perhaps this idea holds, but systems
exceptions are only part of the exception domain. User recoverable
exceptions are the other part of the domain and are likely to happen
on a regular basis. I believe the correct way to build a service
oriented architecture is to map user recoverable situations to checked
exceptions, then to marshall each checked exception back to the client
as a unique exception that client application programmers are able to
handle appropriately. Marshall all runtime exceptions back to the
client as a system exception, along with the stack trace so that it is
easy to troubleshoot the root cause.

I'd like to know what you think about this.

Best Answer

User recoverable exceptions should not be "exceptions". Exceptions are for exceptional circumstances. Transposing a few letters in a form field is something that you should expect and plan for.

Part of the impetus behind a "Service-Oriented Architecture" is that services are reusable. Sure, it might be a client sending messages to it... or it might be another service, or an orchestration engine, or an event subscriber, or an automated task or batch job. These actors can't possibly be able to reliably recover from a fault, no matter how much detail you put into it. In many cases they may even be using one-way messaging (i.e. MSMQ), in which case you're not even allowed to send a fault back; there's simply no channel for it.

Once a service has made the decision to send back a fault message, assuming that the originator can actually receive it, then all the originator can sensibly do is roll back the transaction it's in - if it was smart enough to enlist in one.

Juval is exactly right. Marshaling fault messages into client exceptions is fine when you've exhausted all other options (i.e. unhandled exception), but there is no point in the service trying to provide all kinds of detail. None. Users will not read or understand the error message, and if you think having a stack trace is a benefit from the user perspective then you don't understand the first thing about usability.

Microsoft actually tells you to put exception detail in faults. But don't. Please don't. It just encourages you to be lazy and fault when you really should be handling the errors. I've been down that road and it is one of never-ending pain and misery. It's especially pernicious in WCF because faulting permanently invalidates the service proxy, and it's actually very difficult to design client apps to recover from this, particularly if you're following other "best practices" and doing dependency injection.

What you should - nay, must be doing is logging all errors on the service side, generally into persistent storage, and sending notifications as bug reports. More sophisticated, service-bus architectures will even have an error queue which holds all of the original messages that caused the errors - but at the very least, you want the errors themselves. You want them - not your users. Don't rely on them to give you the stack traces, because if you do, then you have already failed them.

"User recoverable exceptions" simply do not exist in an SOA. There is no such thing because you can't know in advance who the "user" is going to be. If an exception is recoverable then it should be part of the message - for example, in XML form:

<customerUpdateResponse customerId="123" status="notUpdated">
    <validationErrors>
        <requiredFieldMissing field="fullName"/>
        <maxLengthExceeded field="phone" maxLength="30" actualLength="45"/>
    </validationErrors>
</customer>

This is just off the top of my head, but hopefully you get the idea; if an operation can fail for known, documented reasons then that "failure" becomes part of the specification. In this case, the message is sending back an event saying what happened, and the client application can interpret this data appropriately. The important thing is that it is part of the contract, not some unexpected "stop the presses" error.

Now I know that WCF lets you use fault contracts and so on, but honestly, I don't see the point, it's just adding complexity where it's not really needed. SOAP faults are, honestly, a pain in the butt to deal with from any angle.

As mentioned earlier, you also have to carefully plan for the case where you can't send any response. Fledgling "SOAs" with a smattering of web services tend to be predominantly RPC style, but that's actually a poor strategy for designing a robust high-performance architecture. The killer feature of an SOA, in my opinion at least, is publish-subscribe, which allows you to totally decouple the services themselves and only ever share messages. But this comes at a cost: you have to dispense with two-way communication. If a service wants to fault after consuming an event, well, great, but nobody's going to be listening. Which means that proper logging and exception notification is really, really important.

A good overall strategy for the second case is to define a generalized message type for unrecoverable errors (technically you could just use the FaultException) and install a component in the pipeline which forwards all faults to a fault queue, thus (a) ensuring that you don't lose any, and (b) collecting them all into a central location, which will make your life a whole lot easier when you have 30 different web services on 10 different servers. It's really very easy to set up a global exception handler in WCF - just attach to the Faulted event of the ServiceHost. You can also install your own IErrorHandler to do all of this before the fault ever happens - your choice.

But in summary: Instrument your systems so that you can resolve serious issues proactively and don't fault for recoverable errors. To the end user, downtime is downtime; make the exception details discoverable for developers and support staff but don't leak them to users.