I'm writing a huffman encoding program in C. I'm trying to include the least amount of information in the header as possible, I know the simplest way to decompress the file in the header would be to store the frequencies of each character in the file, but for a large file with 256 characters it would take 2304 bytes ((1 byte for character + 8 bytes for long
frequency) * 256), which I don't think is optimal.
I know I can reconstruct a tree from a preorder scan and an inorder scan of it, but that requires having no duplicate values. That is bad because I now have to store each node in the tree (in a huffman tree: n*2 - 1
with n
being the number of unique characters), twice, having each node be a long
value (which could take ((256*2 - 1) * 2) * 8 = 8176
bytes.
Is there a way I'm missing here, or are those my only options?
Thanks.
Best Answer
There are 2 separate problems, store the topography and assign the leaf nodes
Assigning the leaf nodes can be done by storing the characters in in a predefined order so it can be extracted as needed.
Storing topography can be done by having a bit vector with 2 bits per parent node in the previous layer where 1 represents a compound node and 0 represents a leaf node
so first there is 1 bit for the root which is 1 and the next 2 bits will represent the next level down
to build the tree using the
node{char value; node* left, right;}
setup will be:This is 2*n bits in the topography plus the permutation which is O(log n!) at the minimum.