PostgreSQL Backup – Troubleshooting pg_dumpall Hangs

backupdockerpostgresql

I have a bash shell script that dumps all my postgres databases from docker:

function dump_postgres {
    mkdir -p ${BACKUP_DIR}/postgres/
    docker ps -a --format '{{.Names}}\t{{.Ports}}' | grep 5432/tcp | awk '{ print $1 }' | while read -r line; do
        echo "extracting database from container '${line}'"
        docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres > ${BACKUP_DIR}/postgres/${line}.sql
    done
}

dump_postgres >> "${LOG}" 2>> "${ERROR}"

The script figures out which docker containers are listening on the default postgres port and dumps the database in sql format.

My problem is that this command suddenly stops every other day when started by cron. It just stops and the container that does the dump does not exit. Also there is no output on stderr.

Do you have any idea how to solve this?

EDIT:

the crontab looks like this:

30 1 * * * /home/mrbackup/backup.sh

the job is scheduled every night at 1:30. A watchdog script at 3:00 already sees that the backup failed and the pg_dumpall is hanging.

EDIT2:
Tonight quassel is the problem but this changes every other day. nextcloud also makes problems.
some more info about the processes:

# ps -efH
root       493     1  0 Mai24 ?        00:00:04   /usr/sbin/cron -f
root     12401   493  0 01:30 ?        00:00:00     /usr/sbin/CRON -f
mrbackup 12402 12401  0 01:30 ?        00:00:00       /bin/sh -c /home/mrbackup/backup.sh
mrbackup 12403 12402  0 01:30 ?        00:00:00         /bin/bash /home/mrbackup/backup.sh
mrbackup 12412 12403  0 01:30 ?        00:00:00           /bin/bash /home/mrbackup/backup.sh
mrbackup 12428 12412  0 01:30 ?        00:00:11             docker exec -t quassel_postgres_1 pg_dumpall -v --lock-wait-timeout=600 -c -U postgres

root     12449 11150  0 01:30 ?        00:00:00         pg_dumpall -v --lock-wait-timeout=600 -c -U postgres
root     12467 12449  0 01:30 ?        00:00:00           sh -c "/usr/local/bin/pg_dump"  -v --lock-wait-timeout '600' -Fp 'user=postgres dbname=quassel'
root     12468 12467  0 01:30 ?        00:00:16             /usr/local/bin/pg_dump -v --lock-wait-timeout 600 -Fp user=postgres dbname=quassel

strace shows that most sub-processes of pg_dump wait for a lock?

futex(0xc42054cd38, FUTEX_WAIT, 0, NULL

two of the processes restart a syscall:

restart_syscall(<... resuming interrupted futex ...>)

trying to look at the database with this query:

SELECT a.datname,
       c.relname,
       l.mode,
       l.GRANTED,
       a.wait_event_type,
       a.usename,
       a.query, 
       a.query_start,
       age(now(), a.query_start) AS "age", 
       a.pid 
FROM  pg_stat_activity a
  JOIN pg_locks l ON l.pid = a.pid
  JOIN pg_class c ON c.oid = l.relation
WHERE
  NOW() - query_start > '10 second';  
ORDER BY a.query_start;

gives me the following:

 datname |                 relname                 |      mode       | granted | wait_event_type | usename  |                      query                       |          query_start          |       age       |  pid  
---------+-----------------------------------------+-----------------+---------+-----------------+----------+--------------------------------------------------+-------------------------------+-----------------+-------
 quassel | pg_attribute_relid_attnum_index         | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_attribute_relid_attnam_index         | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_attribute                            | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_class                                | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_init_privs_o_c_o_index               | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_roles                                | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_init_privs                           | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095
 quassel | pg_depend_reference_index               | AccessShareLock | t       |                 | postgres | COPY public.sender (senderid, sender) TO stdout; | 2018-07-21 23:30:18.253695+00 | 07:11:41.798605 | 25095

and 100 more rows

EDIT
I have fixed my initial problem, but I would be still interested in why the pipe breaks. Also why the process does not terminate in that situation.

Best Answer

It looks like the piping out of the docker container broke the pipe sometimes. Using the -f parameter from pg_dumpall solved the problem

docker exec -t ${line} pg_dumpall -v --lock-wait-timeout=600 -c -U postgres -f /tmp/export.sql
docker cp ${line}:/tmp/export.sql ${BACKUP_DIR}/postgres/${line}.sql
docker exec -t ${line} rm /tmp/export.sql