You have to expect your program to terminate for more reasons than just an unhandled exception anyway, like a power failure, or a different background process which crashes the whole system. Therefore I would recommend to terminate and restart the application, but with some measures to mitigate the consequences of such a restart and minimize the possible data loss.
Start with analysing the following points:
How much data can actually get lost in case of a program termination?
How severe is such a loss really for the user? Can the lost data reconstructed in less than 5 minutes, or are we talking about losing a days work?
How much effort is it to implement some "intermediate backup" strategy? Don't rule this out because "the user would have to enter a change reason" on a regular save operation, as you wrote in a comment. Better think of something like a temporary file or state, which may be reloaded after a program crash automatically. Many types of productivity software does this (for example MS Office and LibreOffice both have an "autosave" feature and crash recovery).
In case data was wrong or corrupted, can the user see this easily (maybe after a restart of the program)? If yes, you may offer an option to let the user save the data (with some small chance it is corrupted), then force a restart, reload it and let the user check if the data looks fine. Make sure not to overwrite the last version that was saved regularly (instead write to a temporary location/file) to avoid corrupting the old version.
If such an "intermediate backup" strategy is a sensible option depends ultimately on the application and its architecture, and on the nature and structure of the data involved. But if the user will loose less than 10 minutes of work, and such a crash happens once a week or even more seldom, I would probably not invest too much thought into this.
Silent But Deadly
When writing enterprise software, you will eventually learn an essential truth: the worst bug in the world is not one that causes your program to crash. The worst bug in the world is one which causes your program to silently produce a wrong answer that goes unnoticed but eventually produces a massive negative effect (with severe financial implications for your employer). Thus, error messages and crashes are A Good ThingTM, because they indicate that your program detected a problem.
Amazing Grace
Now, this seems to conflict with another enterprise virtue, which is "degrade gracefully". Blowing up and not returning any response at all hardly looks like "graceful degradation". And this is why many folks will try very hard to return some response, if they can. Indeed, this is why many frameworks, like Spring, will catch all top-level exceptions and wrap them with a 500 response, as you describe. In general, I think this is OK. After all, most exceptions don't really require a restart of the entire app server if you can just kill/restart a server thread. A sane framework will be careful to not catch Java Errors
, like OutOfMemory
, for obvious reasons.
But there is one more point to consider: once you get beyond a single server, you will likely have a load balancer in front of your service. And when the LB times out or gets a closed connection, it will generally return a 500 to its client. Thus, the LB will often transform your "server crash" into a client 5xx automatically! Best of both worlds.
Worst Case
In your scenario, what is the worst that can happen if you don't catch the exceptions? Your answer: "Well, my game server dies, and nobody can play!!!" But that's not the worst case. The worst case is, everyone is playing your game, but griefers are ruining it. Players file a bug report and tell you that bans aren't working, but you look at the logs and everything looks fine. Or, legitimate players are getting banned by griefers, and instead of being able to rejoin in a timely manner, the bans are lasting indefinitely, because your server happily ignores failures. The worst thing isn't your game crashing. It's your player trust crashing. Good luck trying to reset that.
Best Answer
Good question.
The first question that comes to my mind is: if the data is already there then in what sense did the save fail? It sure sounds like it succeeded to me. But let's assume for the sake of argument that you really do have many different reasons why an operation can fail.
The second question that comes to my mind is: is the information you wish to return to the user actionable? That is, are they going to make some decision based on that information?
When the "check engine" light comes on, I open up the hood, verify that there is an engine in my car that is not on fire, and take it to the garage. Of course at the garage they have all kinds of special purpose diagnostic equipment that tells them why the check engine light is on, but from my perspective, the warning system is well designed. I do not care whether the problem is because the oxygen sensor is recording an abnormal level of oxygen in the combustion chamber, or because the idle speed detector is unplugged, or whatever. I'm going to take the same action, namely, let someone else figure this out.
Does the caller care why the save failed? Are they going to do anything about it, other than either give up or try again?
Let's assume for the sake of argument that the caller really is going to take different actions depending on the reason why the operation failed.
The third question that comes to mind is: is the failure mode exceptional? I think you might be confusing possible with unexceptional. I would think of two users attempting to modify the same record at the same time as an exceptional-but-possible situation, not a common situation.
Let's assume for the sake of argument that it is unexceptional.
The fourth question that comes to mind is: is there a way to reliably detect the bad situation ahead of time?
If the bad situation is in my "exogenous" bucket, then, no. There's no way to reliably say "did another user modify this record?" because they might modify it after you ask the question. The answer is stale as soon as it is produced.
The fifth question that comes to mind is: is there a way to design the API so that the bad situation can be prevented?
For example, you could make the "save" operation require two steps. Step one: acquire a lock on the record being modified. That operation either succeeds or fails and so can return a Boolean. The caller can then have a policy about how to deal with failure: wait a while and try again, give up, whatever. Step two: once the lock is acquired, do the save and release the lock. Now the save always succeeds and so there is no need to worry about any kind of error handling. If the save fails, that is truly exceptional.