Jenkins jobs get killed randomly once a day at a fixed time

Jenkins

I've got a Jenkins server with 140 clients that execute compilation and test jobs. Since some days, up to roughly a quarter of the jobs got killed at 21:38 +0200 with SIGTERM. Each day at the same time. Each day, a different set of hosts is affected.

I am not aware of any changes to either the Jenkins master or the clients.

With auditctl, we found out the executable and the user of the process that kills our jobs: It's java process that belongs to the user that Jenkins uses to login on the clients. However, it's very short-lifed. It's not in the process lists that I am dump just before and just after 21:38. Neither is its father process.

This is what the audit looks like.

time->Mon Jun 25 21:38:05 2018
type=PROCTITLE msg=audit(1529955485.892:7806): 
proctitle=6A617661002D6A6172002F6F7074[...]
type=OBJ_PID msg=audit(1529955485.892:7806): opid=11583 oauid=1608 
ouid=1608 oses=411 ocomm="java"
type=SYSCALL msg=audit(1529955485.892:7806): arch=c000003e syscall=62 
success=yes exit=0 a0=2d3f a1=0 a2=a a3=2d3f items=0 ppid=29509 pid=29521 
auid=1608 uid=1608 gid=20 euid=1608 suid=1608 fsuid=1608 egid=20 sgid=20 
fsgid=20 tty=(none) ses=411 comm="java" exe="/usr/lib/jvm/java-1.8.0- 
openjdk-1.8.0.131-11.b12.el7.x86_64/jre/bin/java" key=(null)

Neither /var/log/messages, /var/log/cron nor /var/log/secure contain any relevant entries at this point in time.

Any ideas on why my jobs get killed? On how to do further research?

Best Answer

This sounds like the Jenkins Process Tree Killer. When a build exits, the process tree killer attempts to kill all processes related to that build, even if the processes have been disowned from the build process and are no longer child processes of the build process.

For example, I have a set of jobs that run VirtualBox VMs. Sometimes my VMs were dying seemingly randomly. I poked around more and discovered that the VMs all died when another build finished. As it turns out, when running any VirtualBox command, VirtualBox will look for a running VirtualBox daemon and connect to it if it exists or start one if it does not exist. The Jenkins process tree killer was sometimes killing the VirtualBox daemon when a build exited, since the daemon was started by that build.

Your situation sounds similar. I suspect you have a job that finishes every day around the same time, and that when it finishes, the Jenkins process tree killer is reaping background processes that affect your other jobs.

There are instructions in the link above for how to disable the process tree killer for particular jobs.