Algorithms – How to Compress a File with Multiple Same Bytes Using Huffman Coding

algorithmshuffman encoding

On my great quest for compressing/decompressing files with a Java implementation of Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding) for a school assignment, I am now at the point of building a list of prefix codes. Such codes are used when decompressing a file. Basically, the code is made of zeroes and ones, that are used to follow a path in a Huffman tree (left or right) for, ultimately, finding a byte.

In this Wikipedia image, to reach the character m the prefix code would be 0111
enter image description here

The idea is that when you compress the file, you will basically convert all the bytes of the file into prefix codes instead (they tend to be smaller than 8 bits, so there's some gain).

So every time the character m appears in a file (which in binary is actually 1101101), it will be replaced by 0111 (if we used the tree above).

Therefore, 1101101110110111011011101101 becomes 0111011101110111 in the compressed file.

I'm okay with that. But what if the following happens:

  • In the file to be compressed there exists only one unique byte, say 1101101.
  • There are 1000 of such byte.

Technically, the prefix code of such byte would be… none, because there is no path to follow, right? I mean, there is only one unique byte anyway, so the tree has just one node.

Therefore, if the prefix code is none, I would not be able to write the prefix code in the compressed file, because, well, there is nothing to write.

Which brings this problem: how would I compress/decompress such file if it is impossible to write a prefix code when compressing? (using Huffman coding, due to the school assignment's rules)

This tutorial seems to explain a bit better about prefix codes: http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html but doesn't seem to address this issue either.

Best Answer

Actually you would use run-length encoding in such a situation. In essence you store the symbol and the number of repeats.

However, if you still insist on using Huffman coding, you could pick either one or zero to represent the symbol in the tree. This causes no problem really. Your tree simply states that symbol x is represented by bit m. Then you have n number of repeats of the encoded m as data. You could even then apply aforementioned RLE on the actual data (which is essentially a run of equal bits of length n) to compress further.