What does the crash early concept mean

bookspragmatismprogramming practices

While I am reading The Pragmatic Programmer e2, I came across Tip 38: Crash Early. Basically, the author, at least to my understanding, advises to avoid catching exceptions and let the program crash. He goes on saying:

One of the benefits of detecting problems as soon as you can is that
you can crash earlier, and crashing is often the bet thing you can do.
The alternative may be to continue, writing corrupted data to some
vital database or commanding the washing machine into its twentieth
consecutive spin cycle.

Later he says:

In these environments, programs are designed to fail, but that failure
is managed with supervisors. A supervisor is responsible for running
code and knows what to do in case the code fails, which could include
cleaning up after it, restarting it, and so on.

I am struggling to reflect that into real code. What could be the supervisor the author is referring to? In Java, I am used to use a lot of try/catch. Do I need to stop doing that? And replace that with what? Do I simply let the program restart every time there is an exception?

Here is the example the author used (Elixir):

try do
  add_score_to_board(score);
rescue
  InvalidScore
  Logger.error("Can't add invalid score. Exiting");
  raise
rescue
  BoardServerDown
  Logger.error("Can't add score: Board is down. Existing");
  raise
rescue
  StaleTransaction
  Logger.error("Can't add score: stale transaction. Existing");
  raise
end

This is how Pragmatic Programmers would write this:

add_score_to_board(score);

Best Answer

Basically, the author, [...] advises to avoid catching exceptions and let the program crash

No, that is a misunderstanding.

The recommendation is to let a program terminate its execution ASAP when there is an indication that it cannot safely continue (the term "crash" can also be replaced by "end gracefully", if one prefers this). The important word here is not "crash", but "early" - as soon as such an indication becomes aware in a certain part of the code, the program should not "hope" that later executed parts in the code might still work, but simply end execution, ideally with a full error report. And a common way of ending execution is using a specific exception for this, transport the information where the problem occurred to the outermost scope, where the program should be terminated.

Moreover, the recommendation is not against catching exceptions in general. The recommendation is against the abuse of catching unexpected exceptions to prevent the end of a program. Continuing a program though it is unclear whether this is safe or not can mask severe errors, makes it hard to find the root cause of a problem and has the risk of causing more damage than when the program suddenly stops.

Your example shows how to catch some severe exceptions, for logging. But it does not just continue the execution, it rethrows those exceptions, which will probably end the program. That is exactly in line with the "crash early" idea.

And to your question

What could be the supervisor the author is referring to?

Such a supervisor is either a person, which will deal with the failure of a program, or another program running in a separate process, which monitors the activity of other, more complex programs, and can take appropriate actions when one of them "fails".

What this is precisely depends heavily on the kind of program, and the potential costs of a failure. Imagine the failure scenarios for

  • a desktop application with some GUI for managing address data in a database

  • a malware scanner on your PC

  • the software which makes the regular backups for the Stack Exchange sites

  • software which does automatic high speed stock trading

  • software which runs your favorite search engine or social network

  • the software in your newest smart TV or your smartphone

  • controller software for an insulin pump

  • controller software for steering of an airplane

  • monitoring software for a nuclear power plant

I think you can imagine by yourself for which of these examples a human supervisor is enough, or where an "automatic" supervisor is required to keep the system stable even when one of its components fail.

Related Topic