Websites which supply ISO files for download will often give the md5 checksums of those files, which we can use to confirm that the file has downloaded correctly, and has not been corrupted.
Why is this necessary? Surely the error correcting properties of TCP are sufficient. If a packet isn’t received correctly, it will be retransmitted. Doesn’t the very nature of a TCP/IP connection guarantee data integrity?
Best Answer
As has been noted by others, there are many possibilities for data corruption where any checksum at the transport layer cannot help, such as corruption happening already before the checksum is calculated at the sending side, a MITM intercepting and modifying the stream (data as well as checksums), corruption happening after validating the checksum at the receiving end, etc.
If we disregard all these other possibilities and focus on the specifics of the TCP checksum itself and what it actually does in terms of validating data integrity, it turns out that the properties of this checksum are not at all comprehensive in terms of detecting errors. The way this checksum algorithm was chosen rather reflects the requirement for speed in combination with the time period (late 1970's).
This is how the TCP checksum is calculated:
This means that any corruption that balances out when summing the data this way will go undetected. There are a number of categories of corruption to the data that this will allow but just as a trivial example: changing the order of the 16 bit words will always go undetected.
In practice, it catches many typical errors but does not at all *guarantee* integrity. It's also helped by how the L2 layer also does integrity checks (eg CRC32 of Ethernet frames), albeit only for the transmission on the local link, and many cases of corrupted data never even get passed to the TCP stack.
Validating the data using a strong hash, or preferably a cryptographic signature, is on a whole different level in terms of ensuring data integrity. The two can barely even be compared.