Why is it good practice to compare checksums when downloading a file

tcp

Websites which supply ISO files for download will often give the md5 checksums of those files, which we can use to confirm that the file has downloaded correctly, and has not been corrupted.

Why is this necessary? Surely the error correcting properties of TCP are sufficient. If a packet isn’t received correctly, it will be retransmitted. Doesn’t the very nature of a TCP/IP connection guarantee data integrity?

Best Answer

As has been noted by others, there are many possibilities for data corruption where any checksum at the transport layer cannot help, such as corruption happening already before the checksum is calculated at the sending side, a MITM intercepting and modifying the stream (data as well as checksums), corruption happening after validating the checksum at the receiving end, etc.

If we disregard all these other possibilities and focus on the specifics of the TCP checksum itself and what it actually does in terms of validating data integrity, it turns out that the properties of this checksum are not at all comprehensive in terms of detecting errors. The way this checksum algorithm was chosen rather reflects the requirement for speed in combination with the time period (late 1970's).

This is how the TCP checksum is calculated:

Checksum: 16 bits

The checksum field is the 16 bit one's complement of the one's complement sum of all 16 bit words in the header and text. If a segment contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16 bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.

This means that any corruption that balances out when summing the data this way will go undetected. There are a number of categories of corruption to the data that this will allow but just as a trivial example: changing the order of the 16 bit words will always go undetected.


In practice, it catches many typical errors but does not at all *guarantee* integrity. It's also helped by how the L2 layer also does integrity checks (eg CRC32 of Ethernet frames), albeit only for the transmission on the local link, and many cases of corrupted data never even get passed to the TCP stack.

Validating the data using a strong hash, or preferably a cryptographic signature, is on a whole different level in terms of ensuring data integrity. The two can barely even be compared.

Related Topic