Flowcharts are often not a very precise way of indicating what hardware is doing, since flowcharts often imply the existence of a single execution process, whereas hardware often does many overlapping and simultaneous operations.
The portions of the diagram circled in red seem a bit odd. It seems odd to latch A with the value after subtracting B, and then re-add B. More natural would simply be to not bother latching the lower part of the subtraction result. I think the flowchart might be clearer if "named values" were separated into "registers" and "values", and each step either computed values or registers. Thus, for example, one could have something like (assuming 16-bit registers)
C:T[15..0] = (A[14..0]:Q[15]) + ~B-1
if (C or A[15])
A[15..0] = (A[14..0]:Q[15])
Q[15..1] = Q[14..0]
Q[0] = 1
Else
A[15..0] = T[15..0]
Q[15..1] = Q[14..0]
Q[0] = 0
Endif
Every step that updates registers would represent a system clock. Events that merely compute values would not require a clock edge, but would be processed asynchronously.
A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.
Let us compare two cases: this 8x8 bit to 16 bit multiplier:
module multiply (clk, a, b, result);
input clk;
input [7:0]a;
input [7:0]b;
output [15:0]result;
always @(posedge clk)
result = a * b;
endmodule // multiply
and this divider that reduces 8 and 8 bit operands to 8 bit result:
module divide(clk, a, b, result);
input clk;
input [7:0] a;
input [7:0] b;
output [7:0] result;
always @(posedge clk)
result = a / b;
endmodule // divide
(Yes, I know, the clock doesn't do anything)
An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.
The synthesis statistics from Yosys are:
multiply
- Number of wires: 155
- Number of wire bits: 214
- Number of public wires: 4
- Number of public wire bits: 33
- Number of memories: 0
- Number of memory bits: 0
- Number of processes: 0
- Number of cells: 191
- SB_CARRY 10
- SB_DFF 16
- SB_LUT4 165
divide
- Number of wires: 145
- Number of wire bits: 320
- Number of public wires: 4
- Number of public wire bits: 25
- Number of memories: 0
- Number of memory bits: 0
- Number of processes: 0
- Number of cells: 219
- SB_CARRY 85
- SB_DFF 8
- SB_LUT4 126
It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!
You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!
Multiply
Multiply on an ICE40 (warning: ~100 Mpixel image)
Divide
(Divide on an ICE40) (warning: ~100 Mpixel image)
Best Answer
RSA will not be easy to implement and may require a very large FPGA. RSA is far better suited to running on a general purpose CPU than an FPGA. I have seen some implementations of RSA on an FPGA that use a softcore to run the algorithm and the FPGA to accelerate some of the math, but the complete algorithm is not implemented in Verilog. And generally when a file is "RSA encrypted," it usually isn't - the file is generally AES encrypted and the AES key is then RSA encrypted since AES is much faster than RSA. If you want to implement an encryption algorithm on an FPGA, especially for a streaming signal, AES would be a much better idea than RSA. You can probably implement AES in a week, it's a pretty simple algorithm.