I have a cluster submission system based on docker and I'm trying to get it to also support local execution. When executing locally, the command that starts the job is basically
docker run /results/src/launcher/local.sh
For cluster execution another script is being run instead. The difficulty I'm facing is how to run the code as the local user while still supporting CTRL-C correctly. Since docker run starts the entrypoint as uid 0, I need to run the user's entrypoint with su -c
. Basically, the script needs to run two things:
- A prerun script (called as root)
- A Python program (called as calling user)
The meat of the script is currently the following:
# Run prerun script
$PRERUN &
PRERUN_PID=$!
wait $PRERUN_PID
PRERUN_FINISHED=true
status=$?
if [ "$status" -eq "0" ]; then
echo "Prerun finished successfully."
else
echo "Prerun failed with code: $status"
exit $status
fi
# Run main program dropping root privileges.
su -c '/opt/conda/bin/python /results/src/launcher/entrypoint.py \
> >(tee -a /results/stdout.txt) 2> >(tee -a /results/stderr.txt >&2)' \
$USER &
PYTHON_PID=$!
wait $PYTHON_PID
PYTHON_FINISHED=true
status=$?
if [ "$status" -eq "0" ]; then
echo "Entrypoint finished successfully."
else
echo "Entrypoint failed with code: $status"
exit $status
fi
Signal propagation is handled in the same script by:
_int() {
echo "Caught SIGINT signal!"
if [ "$PRERUN_PID" -ne "0" ] && [ "$PRERUN_FINISHED" = "false" ]; then
echo "Sending SIGINT to prerun script!"
kill -INT $PRERUN_PID
PRERUN_PID=0
fi
if [ "$PYTHON_PID" -ne "0" ] && [ "$PYTHON_FINISHED" = "false" ]; then
echo "Sending SIGINT to Python entrypoint!"
kill -INT $PYTHON_PID
PYTHON_PID=0
fi
}
PRERUN_PID=0
PYTHON_PID=0
PRERUN_FINISHED=false
PYTHON_FINISHED=false
trap _int SIGINT
I have a signal handler in /results/src/launcher/entrypoint.py
, which is the code run by su -c
. However, it never seems to get the SIGINT. I assume that the problem lies in the su -c
. As expected PYTHON_PID
in the bash script isn't assigned the PID of the python interpreter, but of the su
program. If I do a os.system("ps xa")
in my Python entrypoint, I see the following:
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /bin/bash /results/src/launcher/local.sh user 1000 1000 /results/src/example/compile.sh
61 ? S 0:00 su -c /opt/conda/bin/python /results/src/launcher/entrypoint.py \ > >(tee -a /results/stdout.txt) 2> >(tee -a /results/stderr.txt >&2) user
62 ? Ss 0:00 bash -c /opt/conda/bin/python /results/src/launcher/entrypoint.py \ > >(tee -a /results/stdout.txt) 2> >(tee -a /results/stderr.txt >&2)
66 ? S 0:01 /opt/conda/bin/python /results/src/launcher/entrypoint.py
67 ? S 0:00 bash -c /opt/conda/bin/python /results/src/launcher/entrypoint.py \ > >(tee -a /results/stdout.txt) 2> >(tee -a /results/stderr.txt >&2)
68 ? S 0:00 bash -c /opt/conda/bin/python /results/src/launcher/entrypoint.py \ > >(tee -a /results/stdout.txt) 2> >(tee -a /results/stderr.txt >&2)
69 ? S 0:00 tee -a /results/stdout.txt
70 ? S 0:00 tee -a /results/stderr.txt
82 ? R 0:00 /opt/conda/bin/python /results/src/launcher/entrypoint.py
83 ? S 0:00 /bin/dash -c ps xa
84 ? R 0:00 ps xa
PYTHON_PID
is assigned the PID 61. However, I would like to be able to gracefully shutdown the python interpreter, so I should be able to catch some signal there. Does anyone know how to forward a SIGINT to the Python interpreter in a situation like this? Would there be a smarter way to do what I'm trying to accomplish? I have full control over the code that puts together the docker run
command when code is scheduled for local execution.
Best Answer
There are a few things going on here. First, you are running a shell script as pid 1 inside the container. That process in various scenarios is what sees the cont+c, or
docker stop
sending the signal, and it is up to bash to trap and handle it. By default, when running as pid 1, bash will ignore the signal (I believe to handle single user mode on a Linux server). You would need to explicitly trap and handle that signal with something like:at the top of the script. That would catch the SIGTERM and SIGINT (generated by a cont+c), kill child processes, and exit immediately.
Next, there is the
su
command, which itself forks a process that can break signal handling. I prefergosu
which runs an exec instead of a fork syscall, removing itself from the process list. You can installgosu
with the following in a Dockerfile:Lastly, there's a lot of logic in the entrypoint to fork and then wait for a background process to finish. This could be simplified by running the processes in the foreground. The last command you run can be started with an
exec
to avoid leaving the shell running. You can catch errors withset -e
, or expand that to show debugging of what commands are being run with a-x
flag. The end result looks like:If you can get rid of the
/results
logs, you should be able to switch from/bin/bash
to/bin/sh
at the top of the script, and just rely ondocker logs
to see the results from the container.