Systemd execute command after start limit reached

systemd

I've been working on a systemd service to wrap an administration script and I'm trying to gracefully handle it completely breaking.

Right now I have Restart set to always so it will try again when something fails, but some failure states require attention (missing config file, bad SQL, etc), so I don't want it continuously spinning in the background in an uncorrectable state.

I found StartLimitInterval, StartLimitBurst, and StartLimitAction, which stops trying to restart it after X failures in Y seconds, but it turns out that the only actions available for StartLimitAction are rebooting or shutting down the machine, which is a little overkill.

I've been looking at OnFailure and wrote a mini service to send an alert email when it's triggered, but OnFailure triggers every time the service dies, not when it hits the start limit, so we get a bunch of emails instead of just one.

Any ideas of what to try next?

Best Answer

From the systemd.unit man page:

OnFailure=

A space-separated list of one or more units that are activated when this unit enters the "failed" state. A service unit using Restart= enters the failed state only after the start limits are reached.

However the second sentence appears to be a new constraint, as it is in the manual for version 241 of systemd on my Arch installations, but not in version 219 on my CentOS 7 installation.

You can check your systemd version with systemctl --version

I know it's an old question but just wanted to share for anyone else who has the same problem.