Architecture – Difference Between Robustness and Fault-Tolerance

Architecturedefinitionerror handlingerrorsfailure

Systems / programs / distributed algorithms / … are often described with the predicate robust or fault-tolerant.

What is the difference?


Details:

When I google for +robust +"fault-tolerant", I only get two hits, both unhelpful.

When I googlescholar for the terms, I find a lot of papers that have both terms in their title. Unfortunately, they do not precisely define the terms šŸ™ But since they use both terms, it seems that neither implies the other.

Best Answer

Both describe the consistency of an application's behavior, but "robustness" describes an application's response to its input, while "fault-tolerance" describes an application's response to its environment.

An app is robust when it can work consistently with inconsistent data. For example: a maps application is robust when it can parse addresses in various formats with various misspellings and return a useful location. A music player is robust when it can continue decoding an MP3 after encountering a malformed frame. An image editor is robust when it can modify an image with embedded EXIF metadata it might not recognize -- especially if it can make changes to the image without wrecking the EXIF data.

An app is fault-tolerant when it can work consistently in an inconsistent environment. A database application is fault-tolerant when it can access an alternate shard when the primary is unavailable. A web application is fault-tolerant when it can continue handling requests from cache even when an API host is unreachable. A storage subsystem is fault-tolerant when it can return results calculated from parity when a disk member is offline.

In both cases, the application is expected to remain stable, behave uniformly, preserve data integrity, and deliver useful results even when an error is encountered. But when evaluating robustness, you may find criteria involving data, while when evaluating fault-tolerance, you'll find criteria involving uptime.

One doesn't necessarily lead to the other. A mobile voice-recognition app can be very robust, providing an uncanny ability to recognize speech consistently in a variety of regional accents with huge amounts of background noise. But if it's useless without a fast cellular data connection, it's not very fault-tolerant. Similarly, a web publishing application can be immensely fault-tolerant, with multiple redundancies at every level, capable of losing whole data centers without failing, but if it drops a user table and crashes the first time someone registers with an apostrophe in their last name, it's not robust at all.

If you're looking for scholarly literature to help describe the distinction, you might look in specific domains that make use of software, rather than broadly software in general. Distributed applications research might be fertile ground for fault-tolerance criteria, and Google has published some of their research that might be relevant. Data modeling research likely addresses questions of robustness, as scientists are particularly interested in the properties of robustness that yield reproducible results. You can probably find papers describing statistical applications that might be helpful, as in climate modeling, RF propagation modeling, or genome sequencing. You'll also find engineers discussing "robust design" in things like control systems.

The Google File System whitepaper describes their approach to fault-tolerance problems, which generally involves the assumptions that component failures are routine and so the application must adapt to them:

This project for a class at Rutgers supports a "component-failure" oriented definition of "fault tolerance":

There are loads of papers on "robust modeling XYZ", depending on the field you investigate. Most will describe their criteria for "robust" in the abstract, and you'll find it all has to do with how the model deals with input.

This brief from a NASA climate scientist describes robustness as a criteria for evaluating climate models:

This paper from an MIT researcher examines wireless protocol applications, a domain in which fault-tolerance and robustness overlap, but the authors use "robust" to describe applications, protocols, and algorithms, while they use "fault-tolerance" in reference to topology and components:

Related Topic