Centos – What’s faster, cp -R or unpacking tar.gz files

centoscompressioncopy

I have some tar.gz files that total many gigabytes on a CentOS system. Most of the tar.gz files are actually pretty small, but the ones with images are large. One is 7.7G, another is about 4G, and a couple around 1G.

I have unpacked the files once already and now I want a second copy of all those files.

I assumed that copying the unpacked files would be faster than re-unpacking them. But I started running cp -R about 10 minutes ago and so far less than 500M is copied. I feel certain that the unpacking process was faster.

Am I right?

And if so, why? It doesn't seem to make sense that unpacking would be faster than simply duplicating existing structures.

Best Answer

Consider the two scenarios:

  • Copy requires that you read the full file from disk and write it to the disk
  • Tar-Gzip requires that you read a smaller file from disk, decompress, and write it to disk.

If your CPU is not being taxed by the decompression process, it stands to reason that the I/O operations are limiting. By that argument (and since you have to write the same amount in both cases), reading a smaller file (the tar.gz) takes less time than reading a larger file. Also time is saved because it is faster to read a single file than to read many small files.

The time saved is dependent on the difference between the time taken to read (I/O) and decompress (CPU). Therefore, for files which are minimally compressible (e.g. already compressed files such as mp3, jpg, zip, etc.), where the time required for decompression is likely to be greater than the time saved on the read operation, it will in fact be slower to decompress than to copy.

(It is worth noting that the slower the I/O, the more time will be saved by using the compressed file - one such scenario would if the source and target of the copy operation are on the same physical disk.)