Linux – Efficiently Retrieve Files from Tar or Cpio Archives

archivecpiolinuxtar

I am using tar to archive a group of very large (multi-GB) bz2 files.

If I use tar -tf file.tar to list the files within the archive, this takes a very long time to complete (~10-15 minutes).

Likewise, cpio -t < file.cpio takes just as long to complete, plus or minus a few seconds.

Accordingly, retrieving a file from an archive (via tar -xf file.tar myFileOfInterest.bz2 for example) is as slow.

Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?

For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).

Is there a tool (or argument to tar or cpio) that allows efficient retrieval of a file within the archive?

Best Answer

tar (and cpio and afio and pax and similar programs) are stream-oriented formats - they are intended to be streamed direct to a tape or piped into another process. while, in theory, it would be possible to add an index at the end of the file/stream, i don't know of any version that does (it would be a useful enhancement though)

it won't help with your existing tar or cpio archives, but there is another tool, dar ("disk archive"), that does create archive files that contain such an index and can give you fast direct access to individual files within the archive.

if dar isn't included with your unix/linux-dist, you can find it at:

http://dar.linux.free.fr/

Related Solutions

Linux command line – search a file inside a tar archive

Try this (omit the initial slash):

tar -tzf backup_2010-09-27.tar.gz etc/passwd

Linux – Creating tar from the files within the specific date range

I think I would actually do this using find and then pass that input into tar. Using your example, let's assume you want files that are between 60 and 90s days old.

find /home/public_html/images -type f -daystart -mtime -90 -and -mtime +60 -print0 | xargs -0 tar -Ajf images_60-90.tar.bz2

This will list all the files that were last modified more than 60 days ago and less than 90 days ago and place those in the tarball named images_60-90.tar.bz2. My use of -print0 and xargs are mostly to protect yourself from files with spaces in the names, and in case there are so many files that they go over the command line maximum length (which can be found by running the command getconf ARG_MAX). I haven't tested that command, and I don't know what happens if you use the append option when the file doesn't exist, so you may have to do more tweaking.

If, however, you know that there are no spaces in any file names, and there will be fewer files than the value of ARG_MAX, you can simply your command a bit.

find /home/public_html/images -type f -daystart -mtime -90 -and -mtime +60 tar -cjf images_60-90.tar.bz2

Best Answer

Related Solutions

Linux command line – search a file inside a tar archive

Linux – Creating tar from the files within the specific date range

Related Topic