Cron – How to monitor and react when some scheduled job fails? – general question

alertscronmonitoringscheduled-task

In many projects my team faced problems with 'silent fails' of some important components. There are lot of tasks executed behind the scenes and if somethings fails (either by errors in logic or hardware problems) in most cases responsible person is not notified (or not notified instantly).

I know about heavy-weight monitoring tools that could solve some of that problems but there over-complicated and too expensive for our team.

I am interested what are your solutions for such problems.

Thanks for your responses so far. To be more accurate I am looking for something that meets following criterias:

  1. reliability – I think that relaying on solutions like cron's MAILTO or executing notification scripts if job script returns some value are not fully reliable (e.g. there are general problems with server). The fully reliable solutions are deployed on separated environment.

  2. possibility to give immediate alert to interested person (emails cannot be treated as immediate in some cases, SMS would be far better). It would be great to prevent from 'emails avalanche' when you receive email every minute with same information.

  3. requires as little knowledge about set up and configuration as possible.

  4. ability to monitor and alert when script execution exceeds some time

  5. alerting rules are maintained from one place.

I did some research and couldn't find anything that covers those criterias. Nagios (or similar tools) are close to be good enough but in my opinion they are to complicated, not user-friendly, require complicated integration. It also reuires to hire someone who is familiar with such tool or spend a lot of time to master them.

The main reason I'm asking about such solutions is that we develop in our software company solution based on interesting approach that can fullful such requirements(or most of them) and already it works pretty good in our projects. Now we are aiming to release it for community and we are looking for some solutions that can do nearly the same to make analysis of advantages and disadvantages of our approach and choose direction of development. Comments about your problems with exisiting solutions and things you really appreciate are kindly welcome as well.

Best Answer

Nagios with passive checks and then wrap your scheduled jobs to send a message (send_nsca)to your nagios server indicating what happened when they complete. If the job errors then nagios will alert.

More relevant to the problem you're seeing is that you can also set nagios to alert if it hasn't heard from your cron job for too long, so you can spot jobs that are failing silently.

All free and fairly trivial to set up.

Related Topic