Zip File Comparison – Proving Two Zip Files Are Identical

comparisonpythonversion control

I migrated some files from one Version Control System to another. I have to prove that content are identical (between source and destination, denote 1 , 2). There are a few zips files (lets call them A B C) each of which contains hundreds of files. I am looking at best way to do CRC comparison between content in old VCS against the new one.

1) Generate CRC on each ZIP file as a whole and compare the CRC of 2 corresponding zips file. ZIP-ZIP CRC comparison.

Obviously, this approach will be easier. But I don't know what will include in calculation of CRC. Or even if CRC of two zip files with identical file content might be different.(modified date?).

2) Compare CRC of each files in zip against corresponding files. File-File CRC Comparison

with this approach, I will have to write a script that goes through each file in zip (say A1) and extract their CRC. Build a list with [path file name, crc]. do the same for zip(A2). Compare list.

Have anyone ever done something like this?

Best Answer

If you're sure the compression algorithm used to create both zip files is identical then you can just compare the zip files.

Otherwise you will need to decompress the zips and compare contained files.

Hashes generated when compressing could be used to speed up comparisons if you'll accept the chance of collisions causing false positives. This can quickly show files to be different.

But collisions mean the best you can do with a hash is show files to "very likely" be identical. With enough bits and a good hashing algorithm we're talking about odds akin to winning the lottery. In a practical application you'll have to decide if speed is worth the risk.

If you're serious about Proof the files are identical you can't ignore unlikely cases. Quarters sometimes land on their edges. Sometimes hashes collide. But sometimes bits flip on you randomly and go undetected. So don't think a bit by bit comparison of the uncompressed files is guarantied to give you a perfect proof either. What you get is lots of bits giving you really good odds.

This last is when the CRC is useful. Not as a digest. It's an error check. It makes a bit copy error less likely to go unnoticed. Still not perfect because CRC bits can be badly copied as well.

So there just isn't a perfect proof. Do it right and you can have fantastic levels of confidence, if you have time for that.

Related Topic