Electronic – How is pipelined DES different from sequential DES

Architecture

I implemented DES (Data Encryption Standard) Coder in VHDL using ISE IDE by Xilinx in sequential architecture which was pretty easy and straightforward. Now my task is to do the same using pipelined architecture so that the whole thing will work with maximum possible clock frequency. I read numerous papers from IEEE site concerning pipelined DES but still I couldn't fully get my head around the topic. What is different? So far I understood that I have to make it so:

There are no complex instructions in the code of every module. For example if I had instruction "x <= a * b * c" I'd have to break it down to two instructions in for loop and when iterator is equal to 0 do "temp <= a * b" and when iterator is equal 1 do "x <= temp * c". Just an example but shows the way of thinking – instead of one complex instructions – many simple ones (using loops or for generate).

Once input A gets to Feistel Round 2 then input B (for example next 64-bit plaintext word) gets loaded immediately to Feistel Round 1 and so on allowing us to process 16 words "quasi simultaneously". That would require us to have a synchronizing register before and after every Feistel Function module. Also I read something about modules for "pipeline" and "control module" although none of the articles mentioned how they worked. To be honest I have zero ideas about how to implement this part.

Are my presumptions wrong at any of the 2 points? Could anyone please explain to me in detail how to bite into this problem? Does anyone have an example of working pipelined DES Coder/Decoder on FPGA? I will be thankful for every bit of help.

Best Answer

a series implementation puts all the data through each stage and you have towait for the output to settle before you can use it.

schematic

^{simulate this circuit – Schematic created using CircuitLab}

with a pipelined implementation there's a register between each stage so that after the first stage has processed the block it's output can be stored in the register and the second block can enter the pipeline. in this way at the expens of some latency throughput can be increased manyfold.

The reason why the lower pipelined implementation is faster is because the first stage can only be clocked as fast as the inverse of the propagation delay of F1-F4 cascaded. If F1-F4 takes super long to compute, then you can't pump a lot of data through it.

The pipedlined datapath can be clocked as fast as the worst of the propagation delays of the stages allow it to. That means, assuming F1-F4 have the same propagation delay, you can pump 4x more data through.

Related Solutions

Electronic – Identify processor type from raw binary code

Try running it through GNU file. If it's got any standard header, it'll pick it up.

Eg.

jrt@lin:~/src$ file foo
foo: ELF 32-bit LSB executable, Atmel AVR 8-bit, version 1 (SYSV), statically linked, not stripped

Computer Architectures – Different Types Explained

There are many different kinds of computer architectures.

One way of categorizing computer architectures is by number of instructions executed per clock. Many computing machines read one instruction at a time and execute it (or they put a lot of effort into acting as if they do that, even if internally they do fancy superscalar and out-of-order stuff). I call such machines "von Neumann" machines, because all of them have a von Neumann bottleneck. Such machines include CISC, RISC, MISC, TTA, and DSP architectures. Such machines include accumulator machines, register machines, and stack machines. Other machines read and execute several instructions at a time (VLIW, super-scalar), which break the one-instruction-per-clock limit, but still hit the von Neumann bottleneck at some slightly larger number of instructions-per-clock. Yet other machines are not limited by the von Neumann bottleneck, because they pre-load all their operations once at power-up and then process data with no further instructions. Such non-Von-Neumann machines include dataflow architectures, such as systolic architectures and cellular automata, often implemented with FPGAs, and the NON-VON supercomputer.

Another way of categorizing computer architectures is by the connection(s) between the CPU and memory. Some machines have a unified memory, such that a single address corresponds to a single place in memory, and when that memory is RAM, one can use that address to read and write data, or load that address into the program counter to execute code. I call these machines Princeton machines. Other machines have several separate memory spaces, such that the program counter always refers to "program memory" no matter what address is loaded into it, and normal reads and writes always go to "data memory", which is a separate location usually containing different information even when the bits of the data address happen to be identical to the bits of the program memory address. Those machines are "pure Harvard" or "modified Harvard" machines. Most DSPs have 3 separate memory areas -- the X ram, the Y ram, and the program memory. The DSP, Princeton, and 2-memory Harvard machines are three different kinds of von Neumann machines. A few machines take advantage of the extremely wide connection between memory and computation that is possible when they are both on the same chip -- computational ram or iRAM or CAM RAM -- which can be seen as a kind of non-von Neumann machine.

A few people use a narrow definition of "von Neumann machine" that does not include Harvard machines. If you are one of those people, then what term would you use for the more general concept of "a machine that has a von Neumann bottleneck", which includes both Harvard and Princeton machines, and excludes NON-VON?

Most embedded systems use Harvard architecture. A few CPUs are "pure Harvard", which is perhaps the simplest arrangement to build in hardware: the address bus to the read-only program memory is exclusively is connected to the program counter, such as many early Microchip PICmicros. Some modified Harvard machines, in addition, also put constants in program memory, which can be read with a special "read constant data from program memory" instruction (different from the "read from data memory" instruction). The software running in the above kinds of Harvard machines cannot change the program memory, which is effectively ROM to that software. Some embedded systems are "self-programmable", typically with program memory in flash memory and a special "erase block of flash memory" instruction and a special "write block of flash memory" instruction (different from the normal "write to data memory" instruction), in addition to the "read data from program memory" instruction. Several more recent Microchip PICmicros and Atmel AVRs are self-programmable modified Harvard machines.

Another way to categorize CPUs is by their clock. Most computers are synchronous -- they have a single global clock. A few CPUs are asynchronous -- they don't have a clock -- including the ILLIAC I and ILLIAC II, which at one time were the fastest supercomputers on earth.

Please help improve the description of all kinds of computer architectures at http://en.wikibooks.org/wiki/Microprocessor_Design/Computer_Architecture .

Best Answer

Related Solutions

Electronic – Identify processor type from raw binary code

Computer Architectures – Different Types Explained

Related Topic