I know of no significant FPGA manufacturer that supports analog-anything, so I don't see why they would support Verilog-AMS or Verilog-A.
For the record (because someone will try to point this out as a "flaw" in my logic): PLL's, ADC's, DAC's, and high speed transceivers are NOT analog circuits in this context. Although modern FPGA's have these blocks in them, the analog parts are not re-configurable and the analog stuff is not exposed to the Verilog/VHDL programmer.
Every couple of years some new FPGA maker pops up claiming to have "Analog FPGA's" or some similar buzzword. Most of the time it's the same FPGA as before, just bought by a new company. I've looked into it in the past and the analog performance wasn't great (signal to noise, dynamic range, frequency response, etc). Usually these "new" FPGA companies never go into production and soon close their doors. I can't remember what the names of any of these companies are, and a quick Google search shows no manufacturers doing anything in the past year.
The closest to an "Analog FPGA" that I know of today is the Cypress Semiconductor PSoC stuff. Although you have to use their graphical user interface to configure the analog blocks and can't use anything like a "language".
The practical FPGA register (a set of D flip-flops) only has one clock pin so it serves exactly one clock domain. Keep this low-level hardware in mind when writing HDL code.
When a signal crosses from one clock domain into another, it's common to use a pipeline of three registers, with the first stage in the source clock domain and the second and third registers in the destination clock domain. This causes some predictable latency, but avoids race conditions and undefined behavior.
// From Xilinx ISE 14.1 Language Templates
// Verilog | Synthesis Constructs | Coding Examples | Misc | Asynchronous Input Synchronization
module async_input_sync(
input clk,
(* TIG="TRUE", IOB="FALSE" *) input async_in,
output reg sync_out
);
(* ASYNC_REG="TRUE", SHREG_EXTRACT="NO", HBLKNM="sync_reg" *) reg [1:0] sreg;
always @(posedge clk) begin
sync_out <= sreg[1];
sreg <= {sreg[0], async_in};
end
endmodule
Synchronize pci_data, pci_wr, and pci_addr into the mcclk clock domain. For example:
wire [xxx] pci_mc_data; // pci_ signal in pciclk clock domain, source mc_data
wire pci_mc_wr; // pci_ signal in pciclk clock domain, source mc_data
wire [xxx] pci_mc_addr; // pci_ signal in pciclk clock domain, source mc_data
async_input_sync(pciclk, mc_data[xxx], pci_mc_data[xx]);
async_input_sync(pciclk, mc_wr, pci_mc_wr);
async_input_sync(pciclk, mc_addr[xxx], pci_mc_addr[xx]);
Synchronize mc_data, mc_wr, and mc_addr into the pciclk clock domain. For example:
wire [xxx] mc_pci_data; // mc_ signal in mcclk clock domain, source pci_data
async_input_sync(mcclk, pci_data[xxx], mc_pci_data[xx]);
You will have one register in the pciclk clock domain, and another register in the mcclk clock domain. Both registers have the same data after the signal propagates across the clock domain. For example, the register on the pci side might look like this:
// All of these signals are in the pciclk clock domain
always @(posedge pciclk) begin
if (pci_wr & pci_addr == 0) begin
pci_cntl_reg <= pci_data;
end
if (pci_mc_wr & pci_mc_addr == 0) begin
pci_cntl_reg <= pci_mc_data;
end
end
Also note the use of non-blocking assignment <= to help the synthesis tool recognize that you're requesting a set of D flip flops. The verilog register
keyword is just a data type, it doesn't necessarily always result in synthesizing a D flip-flop.
Best Answer
The problem of processing a 2-d kernel of data over a large dataset (not just convolution) comes up so regularly in HD video processing that I came up with a generic way of handling it that I use all the time.
I developed a generic "kernel generator" that uses line buffers and registers to present all of the input data for a given output pixel in parallel. An N×N kernel requires N-1 line buffers and N-1 registers. It assumes that the data is arriving in "raster-scan" order, like a TV signal.
simulate this circuit – Schematic created using CircuitLab
The next stage could be your multipliers, but more often than not, I need to handle the edges in some special way, such as zeroing out the values that fall outside the input data, or reflecting the data across the edge, or whatever. Therefore, I have some standard modules (N2 pixels in, N2 pixels out) that consist of counters and multiplexers that do this edge processing before passing the data to the actual data processing module.
For a convolution, you can do all of the multiplies in parallel in the same clock period, but then the adds will have to be pipelined. For example, if you can only add two numbers in one clock period, you'll need a "tree" of adders that's 5 levels deep for a 5×5 kernel. If you can add three or four numbers at a time, you will only need 3 levels of pipeline.
Obviously, the same kernel generator can feed multiple convolutions in parallel if that's what you're doing, but I second Harry Svensson's notion of using FFT techniques if you're doing more than a few of them.