Linux – Corrupted zip file after using split and cat on Linux

compressionfile-transferlinux

I had to split this 2.6 GB zip-file in order to send it thru a slow uplink. I did this:

split -b 879m BIGFILE.zip

This created xaa, xab & xac which I uploaded to the remote server. After the transfer finished I verified each one of these 3 pieces with md5sum (both on my local system and on the server):

md5sum xaa
md5sum xab
md5sum xab

All of the 3 hashes were identical to that of the 3 ones on my system so the transfer went well. Now, on the remote system, when I do this:

cat xa* > BIGFILE.zip

…then I verify the hash of this BIGFILE.zip (on both systems):

md5sum BIGFILE.zip

…and both of them match.

Now comes the interesting part. When I try to list the contents of the zip file I get an error:

unzip -l BIGFILE.zip

I get:

Archive:  BIGFILE.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of BIGFILE.zip or
        BIGFILE.zip.zip, and cannot find BIGFILE.zip.ZIP, period.

This is totally weird. I'm using the same version of "unzip" on both systems. When I use the "unzip -l" on my local system it works.

Thanks for any help.
JFA

Best Answer

Identical MD5 hashes suggest that the transfer has worked well.

More than 2G filesize sounds suspiciously like some pointer size issue - maybe the zip in question doesn't handle that well? more than (ca) 2G would be a negative number in 32 bit... Can you unzip the file on the system where you zipped it? Do both systems differ? Is one 64bit, the problematic 32 bit? What are the filesystems on both systems? Can you find another zip utility?

If you have a chance to retransmit the content, you might want to use tar.gz or keep file size lower than that value. gzip compressed content should handle this better. Zip stores the contents (index) at the end of the file.

Edit: Yup, see here:

In practice, the real limit may be 2 GB on many systems, due to UnZip's use of the fseek() function to jump around within an archive. Because's fseek's offset argument is usually a signed long integer, on 32-bit systems UnZip will not find any file that is more than 2 GB from the beginning of the archive [...]

Related Topic