Electronic – Is it possible to create an IIR filter in an FPGA that’s clocked at the sample frequency

dspfilterfpgaiir

This question is about implementing a IIR filter in a FPGA with DSP slices, with very specific criteria.

Lets say you're making a filter with no forward taps and only 1 reverse tap, with this equation:

$$y[n] = y[n-1] \cdot b1 + x[n]$$

(see image)

Take the DSP48A1 slice from Xilinx as an example – most hard IP DSP slices are similar.

Lets say you have analog data incoming at 1 sample per clock. I would like to design an IIR filter that runs synchronously at the sample clock.

The problem is that in order to run the DSP slice at the maximum rate, you can't multiply AND add on the same cycle. You have to have a pipeline register between these components.

So, if you have 1 new sample every clock, you will need to produce 1 output per clock. However, you need the previous output 2 clocks before you can produce a new one in this design.

The obvious solution is to either process the data at double clock rate, or to disable the pipeline register so that you can multiply and add in the same cycle.

Unfortunately, if say you're sampling at the max clock rate of the fully pipelined DSP slice, neither of those solutions are possible. Is there any other way to build this?

(Bonus points if you can design a IIR filter that operates at half of the sample rate, using any number of DSP slices)

The goal would be to run a compensation filter for a 1 GSPS ADC in a Xilinx Artix FPGA. Their DSP slices can run just over 500 MHz when fully pipelined. If there is a solution for 1 sample per clock, I would like to try and scale the solution for 2 samples per clock. This is all very easy with an FIR filter.

single feedback IIR filter example

Best Answer

I haven't worked with IIR filters yet, but if you only need to calculate the given equation

y[n] = y[n-1]*b1 + x[n]

once per CPU cycle, you can use pipelining.

In one cycle you do the multiplication and in one cycle you need to do the summation for each input sample. That means your FPGA must be able to do the multiplication in one cycle when clocked at the given sample rate! Then you'll only need to do the multiplication of the current sample AND the summation of the last sample's multiplication result in parallel. This will causes a constant processing lag of 2 cycles.

Ok, let's have a look at the formula and design a pipeline:

y[n] = y[n-1]*b1 + x[n]

Your pipeline code could look like this:

output <= last_output_times_b1 + last_input
last_output_times_b1 <= output * b1;
last_input <= input

Note that all three commands need to be executed in parallel and that "output" in the second line therefore uses the output from the last clock cycle!

I didn't work much with Verilog, so this code's syntax is most possibly wrong (e.g. missing bit-width of input/output signals; execution syntax for multiplication). However you should get the idea:

module IIRFilter( clk, reset, x, b, y );
  input clk, reset, x, b;
  output y;

  reg y, t, t2;
  wire clk, reset, x, b;

  always @ (posedge clk or posedge reset)
  if (reset) begin
    y <= 0;
    t <= 0;
    t2 <= 0;
  end else begin
    y <= t + t2;
    t <= mult(y, b);
    t2 <= x
  end

endmodule

PS: Maybe some experienced Verilog programmer could edit this code and remove this comment and the comment above the code afterwards. Thanks!

PPS: In case your factor "b1" is a fixed constant, you might be able to optimize the design by implementing a special multiplier that only takes one scalar input and calculates "times b1" only.

Response to: "Unfortunately, this is actually equivalent to y[n] = y[n-2] * b1 + x[n]. This is because of the extra pipeline stage." as comment to old version of answer

Yes, that was actually right for the following old (INCORRECT!!!) version:

  always @ (posedge clk or posedge reset)
  if (reset) begin
    t <= 0;
  end else begin
    y <= t + x;
    t <= mult(y, b);
  end

I hopefully corrected this bug now by delaying the input values, too in a second register:

  always @ (posedge clk or posedge reset)
  if (reset) begin
    y <= 0;
    t <= 0;
    t2 <= 0;
  end else begin
    y <= t + t2;
    t <= mult(y, b);
    t2 <= x
  end

To make sure it works correctly this time let's look what happens at the first few cycles. Note that the first 2 cycles produce more or less (defined) garbage, as no previous output values (e.g. y[-1] == ??) are available. The register y is initialized with 0, which is equivalent to assuming y[-1] == 0.

First Cycle (n=0):

BEFORE: INPUT (x=x[0], b); REGISTERS (t=0, t2=0, y=0)

y <= t + t2;      == 0
t <= mult(y, b);  == y[-1] * b  = 0
t2 <= x           == x[0]

AFTERWARDS: REGISTERS (t=0, t2=x[0], y=0), OUTPUT: y[0]=0

Second Cycle (n=1):

BEFORE: INPUT (x=x[1], b); REGISTERS (t=0, t2=x[0], y=y[0])

y <= t + t2;      ==     0  +  x[0]
t <= mult(y, b);  ==  y[0]  *  b
t2 <= x           ==  x[1]

AFTERWARDS: REGISTERS (t=y[0]*b, t2=x[1], y=x[0]), OUTPUT: y[1]=x[0]

Third Cycle (n=2):

BEFORE: INPUT (x=x[2], b); REGISTERS (t=y[0]*b, t2=x[1], y=y[1])

y <= t + t2;      ==  y[0]*b +  x[1]
t <= mult(y, b);  ==  y[1]   *  b
t2 <= x           ==  x[2]

AFTERWARDS: REGISTERS (t=y[1]*b, t2=x[2], y=y[0]*b+x[1]), OUTPUT: y[2]=y[0]*b+x[1]

Fourth Cycle (n=3):

BEFORE: INPUT (x=x[3], b); REGISTERS (t=y[1]*b, t2=x[2], y=y[2])

y <= t + t2;      ==  y[1]*b +  x[2]
t <= mult(y, b);  ==  y[2]   *  b
t2 <= x           ==  x[3]

AFTERWARDS: REGISTERS (t=y[2]*b, t2=x[3], y=y[1]*b+x[2]), OUTPUT: y[3]=y[1]*b+x[2]

We can see, that beginning with cylce n=2 we get the following output:

y[2]=y[0]*b+x[1]
y[3]=y[1]*b+x[2]

which is equivalent to

y[n]=y[n-2]*b + x[n-1]
y[n]=y[n-1-l]*b1 + x[n-l],  where l = 1
y[n+l]=y[n-1]*b1 + x[n],  where l = 1

As mentioned above we introduce an additional lag of l=1 cycles. That means that your output y[n] is delayed by lag l=1. That means the output data is equivalent but is delayed by one "index". To be more clear: The output data delayed be 2 cycles, as one (normal) clock cycle is needed and 1 additional (lag l=1) clock cycle is added for the intermediate stage.

Here is a sketch to graphically depict how the data flows:

sketch of data flow

PS: Thank you for having a close look at my code. So I learned something, too! ;-) Let me know if this version is correct or if you see any more issues.

Related Solutions

Electronic – Pipelining and clock frequency issue

Your design will not function correctly if it runs at 100 MHz but is only spec'd (by the tools) to run at 50 MHz. If it does, then it's a one-off miracle that wouldn't work when you make a change and rerun the tools. Don't do it. Don't even do it if your clock is 100 MHz and the tools tell you the design can run at 99.5 MHz.

To solve your problem you can either write a simple 'divide by power of 2' clock divider to reduce the clock frequency (something like this in Verilog):

reg [n:0] count; 

always @(posedge CLK_100) begin
  count <= count + 1;
end

BUFG bg_0 (.I(count[m]), .O(CLK_DIV));

(where 'm' <= 'n' and 'bufg' is a global clock buffer, and must be used for synchronous designs) or use a Digital Clock Manager (DCM).

Hopefully that solves your pipelining issues as well unless you absolutely have to run the entire design at 100 MHz. Other than pipelining you can consider using FIFOs if you have part of the design running at 50 MHz and the other at 100 MHz, but you'll have to say a bit more about what you're doing to get more meaningful help here.

Electronic – FPGA firmware design: How big is too big

I wonder if there is another way of looking at the problem?

Playing off your estimation of 512 FFT operations (64 point each) and 42k MAC operations... I presume this is what you need for one pass through the algorithm?

Now you have found an FFT core using 4 DSP units ... but how many clock cycles does it take per FFT? (throughput, not latency)? Let's say 64, or 1 cycle per point. Then you have to complete those 42k Mac operations in 64 cycles - perhaps 1k MACs per cycle, with each MAC handling 42 operations.

Now it is time to look at the rest of the algorithm in more detail : identify not MACs but higher level operations (filtering, correlation, whatever) that can be re-used. Build cores for each of these operations, with reusability (e.g. filters with different selectable coefficient sets) and soon you may find relatively few multiplexers are required between relatively large cores...

Also, is any strength reduction possible? I had some cases where multiplications in loops were required to generate quadratics (and higher). Unrolling them, I could iteratively generate them without multiplication : I was quite pleased with myself the day I built a Difference Engine on FPGA!

Without knowing the application I can't give more details but some such analysis is likely to make some major simplifications possible.

Also - since it sounds as if you don't have a definite platform in mind - consider if you can partition across multiple FPGAs ... take a look at this board or this one which offer multiple FPGAs in a convenient platform. They also have a board with 100 Spartan-3 devices...

(p.s. I was disappointed when the software guys closed this other question - I think it's at least as appropriate there)

Edit : re your edit - I think you are starting to get there. If all the multiplier inputs are either FFT outputs, or "not-filter" coefficients, you are starting to see the sort of regularity you need to exploit. One input to each multiplier connects to an FFT output, the other input to a coefficient ROM (BlockRam implemented as a constant array).

Sequencing different FFT operations through the same FFT unit will automatically sequence the FFT outputs past this multiplier. Sequencing the correct coefficients into the other MPY input is now "merely" a matter of organising the correct ROM addresses at the correct time : an organisational problem, rather than a huge headache of MUXes.

On performance : I think Dave Tweed was being needlessly pessimistic - the FFT taking n*log(n) operations, but you get to choose O(n) butterfly units and O(logN) cycles, or O(logN) units and O(n) cycles, or some other combination to suit your resource and speed goals. One such combination may make the post-FFT multiply structure much simpler than others...

Best Answer

Response to: "Unfortunately, this is actually equivalent to y[n] = y[n-2] * b1 + x[n]. This is because of the extra pipeline stage." as comment to old version of answer

Related Solutions

Electronic – Pipelining and clock frequency issue

Electronic – FPGA firmware design: How big is too big

Related Topic