Electronic – Efficient single cycle huffman decoding technique with wider input data

compressiondecoderfpgaverilogvlsi

I have a wider input data (Byte width) running in my system, and I'm implementing a huffman decoding on this incoming data. Since, the Huffman encoded words don't have the fixed length, hence I need to build a very complex state machine to cover all the possible combination for the decoding process to minimize the decoding time. (The primarily consideration of the implementation is to decode as many as words from the incoming data, and push the only incomplete encoded word to next cycle.)

I'm sure this implementation is less efficient especially it will easily to grow combinatorial logic which at the same time is hurting the system Fmax.

Is there any alternative technique that can ease the implementation? Still, fast decoding time is the primary consideration factor.

Example:

  1. Input data per clock cycle : 8 data bit
  2. Encoded word length: 1 bit / 4 bit / 5 bit / 6 bit / 7 bit

Case 1) In a single clock cycle, the input data can contain multiple encoded words, such as total 4 encoded word (3 x 1 encoded word & and 1 x 5 encoded word ), this is a very straight forward case as we can simply decode all the incoming data in a single clock cycle

Case 2) In a single clock cycle, the incoming 8 bits data only contain 1 x 6 bit length encoded word. In this case, we need to push the left over 2 bit data to be used in the next clock cycle decoding. Hence, the next decoding process need to decode from the 10 bits data instead the original 8 bits data.

Based on the case 2, If we apply it in all the possible decoding combinations, it will translate to a very bulky & complex logic. (The original problem statement)

Best Answer

If there's one thing that FPGAs are good at, it's lookup tables — both the little LUTs used for logic and the larger block RAMs that can be used as ROMs. In order to avoid falling behind, in the worst case, you need to be able to decode up to 14 input bits (two 7-bit symbols) at a time. A 16k-word (but rather wide) ROM can easily do it in a single clock cycle. You just pre-compute all of the possible decodings for 14-bit combinations and load them into the ROM. The outputs of the ROM will be:

  • The number of complete symbols found in the current set of 16 input bits (1-8, 3-bits)1
  • The decoded symbols (8 × 7 bits)
  • The number of bits left over for the next cycle (2-8, 3 bits)

for a total of 62 bits of width. This is just under a megabit of block RAM, which would well within the capability of many modern FPGAs. You just need to add a 16-bit register to hold the input data and a barrel shifter2 to move it around as each new byte comes in.


1 Note that it doesn't matter if there are more than 8 symbols in the current set of input bits. If that happens, the rest of them get decoded on the next clock cycle.

2 And remember, if your FPGA has dedicated hardware multiplier modules, they make great barrel shifters!

Related Topic