Electronic – Does it take long to implement RSA in hardware

algorithmdigital-logicencoderfpgaverilog

I just finished my first Digital Hardware course. We covered combinational circuits, sequential circuits and FSMs.

We now need to create a final design project. We have 2 weeks to do so and we work in teams of 2.

I want to implement RSA encryption in hardware. Essentially, the FPGA would take a audio signal as input, run RSA on it and output the encrypted audio signal. We will be using Verilog and a DE2 board. I'm worried that this might take too to do though.

Could someone tell me if implementing RSA in hardware is a reasonable goal for a 2 week project for someone who has only taken a single digital electronics course? This is my first engineering project so I'm not very good at scoping yet.

Thank you.

Best Answer

RSA will not be easy to implement and may require a very large FPGA. RSA is far better suited to running on a general purpose CPU than an FPGA. I have seen some implementations of RSA on an FPGA that use a softcore to run the algorithm and the FPGA to accelerate some of the math, but the complete algorithm is not implemented in Verilog. And generally when a file is "RSA encrypted," it usually isn't - the file is generally AES encrypted and the AES key is then RSA encrypted since AES is much faster than RSA. If you want to implement an encryption algorithm on an FPGA, especially for a streaming signal, AES would be a much better idea than RSA. You can probably implement AES in a week, it's a pretty simple algorithm.

Related Solutions

Electronic – hardware implementation of division algorithm

Flowcharts are often not a very precise way of indicating what hardware is doing, since flowcharts often imply the existence of a single execution process, whereas hardware often does many overlapping and simultaneous operations.

The portions of the diagram circled in red seem a bit odd. It seems odd to latch A with the value after subtracting B, and then re-add B. More natural would simply be to not bother latching the lower part of the subtraction result. I think the flowchart might be clearer if "named values" were separated into "registers" and "values", and each step either computed values or registers. Thus, for example, one could have something like (assuming 16-bit registers)

C:T[15..0] = (A[14..0]:Q[15]) + ~B-1
if (C or A[15])
  A[15..0] = (A[14..0]:Q[15])
  Q[15..1] = Q[14..0]
  Q[0] = 1
Else
  A[15..0] = T[15..0]
  Q[15..1] = Q[14..0]
  Q[0] = 0
Endif

Every step that updates registers would represent a system clock. Events that merely compute values would not require a clock edge, but would be processed asynchronously.

Electronic – Why does hardware division take much longer than multiplication

A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.

Let us compare two cases: this 8x8 bit to 16 bit multiplier:

module multiply (clk, a, b, result);
   input clk;
   input [7:0]a;
   input [7:0]b;
   output [15:0]result;
   always @(posedge clk)
     result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:

module divide(clk, a, b, result);
   input clk;
   input [7:0] a;
   input [7:0] b;
   output [7:0] result;
   always @(posedge clk)
     result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)

An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.

The synthesis statistics from Yosys are:

multiply

Number of wires: 155
Number of wire bits: 214
Number of public wires: 4
Number of public wire bits: 33
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 191
- SB_CARRY 10
- SB_DFF 16
- SB_LUT4 165

divide

Number of wires: 145
Number of wire bits: 320
Number of public wires: 4
Number of public wire bits: 25
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 219
- SB_CARRY 85
- SB_DFF 8
- SB_LUT4 126

It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!

You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!

Multiply

Multiply on an ICE40 (warning: ~100 Mpixel image)

Divide

(Divide on an ICE40) (warning: ~100 Mpixel image)