Electronic – When developing an algorithm for FPGA, should I be aware of amount of logic blocks (and other FPGA-specific properties)

fpgaverilogvhdl

I want to see how certain algorithm would fit FPGA architecture and plan to implement it with HLS tool like CλaSH which produces VHDL/Verilog. All I know about FPGAs so far is it's an array of interconnected logic blocks. So I wonder how can I ensure it's gonna be possible at all since every code construct consumes a certain amount of logic blocks of FPGA and it simply might not have enough of it?

For example when developing an algorithm for CPU prior to even trying to run it you need to ensure that it will have enough memory to load the code. It's never the problem with modern PCs anymore but it's very tight with FPGAs, right?

Is there a way to determine how much logic blocks an implementation requires prior to running it (… purchasing a dev board)? Are there any other considerations needed to be accounted for like logic blocks array width/length (which determine the flow parallelism?), etc?

Best Answer

Your question is rather broad.

To start with: the good new is that you don't need to buy an FPGA board to find out how big your design is. The development tool will tell you. It will also tell you if you exceed the number of resources (Memories, LUTs, Registers, DSPs or I/O pins.) If it does not fit, you select a bigger FPGA in the tool setting, until you get to the really BIG ones you probably can't afford because they are e.g. $15000 each.

The second good new is that most FPGA development tools are free, at least for the smaller FPGAs. And 'small' is still rather big.

The not-so-good new is that HLS is still in development. We ran some tests and they still markedly under-perform compared to Verilog or VHDL. But for just comparing algorithms they are probably good enough.

Now as to "flow, parallelism" you get into difficult areas. The more logic in parallel or more pipeline stages the faster the algorithm will run. But also the resources utilization (area) will go up. It is one of the many tasks of an HDL designer to try to find a balance between speed and area.

Getting to "array width/length". That is the fastest way I found to fill an FPGA. I recently designed code for convolution matrices. It was a module which had the matrix width/height as parameters. With little trouble I managed to fill 60% of the FPGA with that module alone (It was supposed to use 15%).

Related Solutions

Electronic – What configuration should I use for a system that includes an ARM and an FPGA

I can't comment on your specific application (not being a cryptography expert), however placing a processor on board with a FPGA is an exceedingly common thing to do. Mostly the reason is that you now free up FPGA space to do what FPGA is good at, while using the less expensive separate processor to do what it is good at, perhaps even faster than could be done with a soft CPU running in the FPGA. In addition, larger FPGA's can get quite expensive, compared to faster ARM's which can be fairly reasonably priced.

Basically I think you should use the two chips, but it's hard to make a proclamation for sure without knowing details about your specific area.

Electronic – Why does this Verilog hog down 30 macrocells and hundreds of product terms

The code you show is essentially a priority encoder. That is, it has an input of many signals, and its output indicates which of those signals is set, giving priority to the left-most set signal if more than one is set.

However, I see conflicting definitions of the standard behavior for this circuit in the two places I checked.

According to Wikipedia, the standard priority encoder numbers its inputs from 1. That is, if the least significant input bit is set, it outputs 1, not 0. The Wikipedia priority encoder outputs 0 when none of the input bits are set.

Xilinx's XST User Guide (p. 80), however, defines a priority encoder closer to what you coded. The inputs are numbered from 0, so when the input's lsb is set it gives a 0 output. However, the Xilinx definition gives no spec for the output when all input bits are clear (Your code will output 3'd7).

The Xilinx user guide, of course, will determine what the Xilinx synthesis software is expecting. The main point is that a special directive (*priority_extract ="force"*) is required for XST to recognize this structure and generate optimal synthesis results.

Here's Xilinx's recommended form for an 8-to-3 priority encoder:

(* priority_extract="force" *)
module v_priority_encoder_1 (sel, code);
input [7:0] sel;
output [2:0] code;
reg [2:0] code;
always @(sel)
begin
    if (sel[0]) code = 3’b000;
    else if (sel[1]) code = 3’b001;
    else if (sel[2]) code = 3’b010;
    else if (sel[3]) code = 3’b011;
    else if (sel[4]) code = 3’b100;
    else if (sel[5]) code = 3’b101;
    else if (sel[6]) code = 3’b110;
    else if (sel[7]) code = 3’b111;
    else code = 3’bxxx;
end
endmodule

If you can rearrange your surrounding logic to let you use Xilinx's recommended coding style, that's probably the best way to get a better result.

I think you can get this by instantiating the Xilinx encoder module with

v_priority_encoder_1 pe_inst (.sel({~|{RL[6:0]}, RL[6:0]}), .code(rlever));

I've nor'ed together all bits of RL[6:0] to get an 8th input bit that will trigger the 3'b111 output when all RL bits are low.

For the llever logic, you can probably reduce the resource usage by making a modified encoder module, following the Xilinx template, but requiring only 7 input bits (your 6 bits of LL plus an additional bit that goes high when the other 6 are all low).

Using this template assumes the version of ISE you have is using the XST synthesis engine. It seems like they change synthesis tools on every major rev of ISE, so check that the document I linked actually corresponds to your version of ISE. If not, check the recommended style in your documentation to see what your tool expects.

Best Answer

Related Solutions

Electronic – What configuration should I use for a system that includes an ARM and an FPGA

Electronic – Why does this Verilog hog down 30 macrocells and hundreds of product terms

Related Topic