Tar does not retain directory permissions

permissionstar

I am copying a directory structure of hundreds of millions of small images between two servers. The file structure, ownership and permissions needs to be retained during the copy. Our testing has shown that the fastest way to perform this copy is to tar the files and pipe them over netcat with something like the following commands:

# TARGET (extract):
$ nc -l 2222 | pigz -d | sudo tar xpf - --same-owner -C /

# SOURCE: 
$ tar -cf - -T selected-images-to-copy.txt | pigz | pv | nc 1.1.1.1 2222

Other methods to perform the copy (e.g. rsync, scp) are simply too slow taking weeks to complete as they do not saturate the network, whereas this approach will complete within a matter of days. However, whilst the images themselves are being created with the right ownership and permissions, the directories that the extraction is performing is not.

If I don't extract the tar, but instead view the contents I have:

$ tar tvzf test.tar.gz
-rw-r--r-- root/www-data 319434 2017-09-23 05:47 mnt/a/b/c/0012Z.jpg
-rw-r--r-- root/www-data 323647 2017-09-23 05:47 mnt/a/b/c/0005Z.jpg
-rw-r--r-- root/www-data 315962 2017-09-23 05:47 mnt/a/b/c/0013Z.jpg
-rw-r--r-- root/www-data 313594 2017-09-23 05:47 mnt/a/b/c/0007Z.jpg

However, when extracted all the folders created by the extract between mnt and the image are owned by root:root and have the permissions 0750 which means that they are inaccessible to anyone but root.

$ sudo ls -al mnt/a/b
total 12
drwxr-x--- 3 root root 4096 Oct  6 15:01 .
drwxr-x--- 3 root root 4096 Oct  6 15:01 ..
drwxr-x--- 3 root root 4096 Oct  6 15:01 c

Because of the number of files, recursive operations like chown and chmod would take forever to run. We have a custom python script that will alter the permissions, but again this adds days to the process; so I'd like to get the permissions right out of the box if possible.

Note: in researching this, I did find this server fault question that raises a similar issue, but the conclusion was that it is a bug that was fixed in tar v1.24.

$ tar --version
tar (GNU tar) 1.27.1

Best Answer

If selected-images-to-copy.txt is a list of files only (the last element of the path is always a file, not a directory) here's a solution to create the archive with proper directory rights:

EDIT: I added a better solution at the end while keeping the intermediate(s) solution(s) around, capitalizing on dave_thompson_085 's comments and thinking on what could be improved with the informations available.
As he wrote, (and as I didn't explain completely,) the important part of the solution is to use --no-recursion. This allows to save all meta informations for each manually added directory in the path, up to the files themselves, without including all other unwanted directories and files that would be recursively added otherwise.

awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt > selected-images-to-copy-with-explicit-arborescences.txt
tar cf - --no-recursion -T selected-images-to-copy-with-explicit-arborescences.txt | pigz | pv | nc 1.1.1.1 2222

If you really want to do it on-the-fly, using bash's <() construct:

tar cf - --no-recursion -T <(awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt) | pigz | pv | nc 1.1.1.1 2222

The awk command just reconstructs and adds the path, one directory level at a time up to the file itself.

That way any directory in the path of a file to save is also put in the archive, but with the --no-recursion nothing else will happen. So every directory ownership before the file will be saved and restored correctly.

There's still a problem of performance you have to trade somewhere: there will be many repeating arborescences, so the 2nd tar will often redo a chown on the same base directory. You could sort -u the result of awk to remove all those duplicates, but then sort might take a very long time before giving the results and the transfer to start. With a short perl script that will store unique elements in memory (trade-off is memory usage but I doubt that's a problem) there's no need to sort to output unique entries without delay. So the solution becomes:

tar cf - --no-recursion -T <(awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt | perl -w -e 'use strict; my %unique; while (<>) { if (not $unique{$_}++) { print } }'  ) | pigz | pv | nc 1.1.1.1 2222

EDIT: If the content of selected-images-to-copy.txt is more or less a sorted list of files (the unsorted output of a find [...] -type f kind of command is good enough), here's a solution that doesn't need any memory usage (which might indeed have become a problem with hundreds of millions of entries) It is good enough to just remember the last longest path, and compare it to the next path:
- either the next is not a prefix of the previous, meaning it's a new arborescence (or new file in the same arborescence) and has to be archived and in this case is designed the new "last longest path". If the intial list wasn't at least presented as a tree (as in at least a find command output, or of course a sorted list), some begnin repetitions will appear.
- either it's a prefix (a substring matching from the 1st character), meaning it's a directory that was already seen since it's part of the path of the previous, and can be safely ignored.

I'm adding a trailing / in the comparison to easily find that mnt/a/b/foo/ isn't a prefix of mnt/a/b/foobar . With mnt/a/b/foobar/file4.png and mnt/a/b/foo/file5.png as input, the ownership of the directory mnt/a/b/foo wouldn't have been restored without this trick. So the perl command is replaced with:

awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }'

This sample:

file1.png
mnt/a/b/file2.png
mnt/a/b/file3.png
mnt/a/b/c/foobar/file4.png
mnt/a/b/c/foo/file5.png
mnt/a/b/file6.png
mnt/a/b/d/file7.png

Through this filter:

awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' | awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }'

Gives those directories and files ready for tar --no-recursion:

file1.png
mnt
mnt/a
mnt/a/b
mnt/a/b/file2.png
mnt/a/b/file3.png
mnt/a/b/c
mnt/a/b/c/foobar
mnt/a/b/c/foobar/file4.png
mnt/a/b/c/foo
mnt/a/b/c/foo/file5.png
mnt/a/b/file6.png
mnt/a/b/d
mnt/a/b/d/file7.png

So the solution with the whole pair of commands becomes (root already uses -p and --same-owner, and better drop bash's fancy <() when a | can work and easily allows to break the long line with \ for readability) :

# TARGET (extract):
$ nc -l -p 2222 | pigz -d | sudo tar xf - -C /

# SOURCE: 
$ awk -F/ '{ d=$1; for (i=2; i <= NF; i++) { print d; d=d "/" $i }; print d }' selected-images-to-copy.txt | \
      awk '{ if (index(old,$0 "/") != 1) { old=$0; print } }' | \
      tar cf - --no-recursion -T - | pigz | pv | nc -w 60 1.1.1.1 2222
Related Topic