Electronic – How is pipelined DES different from sequential DES

Architecture

I implemented DES (Data Encryption Standard) Coder in VHDL using ISE IDE by Xilinx in sequential architecture which was pretty easy and straightforward. Now my task is to do the same using pipelined architecture so that the whole thing will work with maximum possible clock frequency. I read numerous papers from IEEE site concerning pipelined DES but still I couldn't fully get my head around the topic. What is different? So far I understood that I have to make it so:

There are no complex instructions in the code of every module. For example if I had instruction "x <= a * b * c" I'd have to break it down to two instructions in for loop and when iterator is equal to 0 do "temp <= a * b" and when iterator is equal 1 do "x <= temp * c". Just an example but shows the way of thinking – instead of one complex instructions – many simple ones (using loops or for generate).

Once input A gets to Feistel Round 2 then input B (for example next 64-bit plaintext word) gets loaded immediately to Feistel Round 1 and so on allowing us to process 16 words "quasi simultaneously". That would require us to have a synchronizing register before and after every Feistel Function module. Also I read something about modules for "pipeline" and "control module" although none of the articles mentioned how they worked. To be honest I have zero ideas about how to implement this part.

Are my presumptions wrong at any of the 2 points? Could anyone please explain to me in detail how to bite into this problem? Does anyone have an example of working pipelined DES Coder/Decoder on FPGA? I will be thankful for every bit of help.

Best Answer

a series implementation puts all the data through each stage and you have towait for the output to settle before you can use it.

schematic

simulate this circuit – Schematic created using CircuitLab

with a pipelined implementation there's a register between each stage so that after the first stage has processed the block it's output can be stored in the register and the second block can enter the pipeline. in this way at the expens of some latency throughput can be increased manyfold.

The reason why the lower pipelined implementation is faster is because the first stage can only be clocked as fast as the inverse of the propagation delay of F1-F4 cascaded. If F1-F4 takes super long to compute, then you can't pump a lot of data through it.

The pipedlined datapath can be clocked as fast as the worst of the propagation delays of the stages allow it to. That means, assuming F1-F4 have the same propagation delay, you can pump 4x more data through.