Electronic – digital bandpass filter with parallel inputs

dacdigital filterdigital-logicfilterfpga

I have a high speed ADC connected to an FPGA. The ADC is designed such that you get 16 samples every FPGA clock cycle (that's samples, not bits). The 16 samples are from a single ADC input channel, they are just collected over time and sent as 1 big chunk (samp[0] is the earliest sample in time, samp[15] is the latest sample in time).

I need to design a band pass filter for the parallel-stream, but the digital implementations I found online always assume I have single sample per clock cycle (one sample gets pushed into the filter structure, one sample comes out of the filter structure). I cannot simply create an internal clock that is 16x faster, as this high-frequency would be too fast for the FPGA. I need to be able to filter it, but every clock cycle I send in 16-samples and output 16 filtered samples.

Can someone help me get started? What would this kind of filter be called? Is there some math-trick where I can just build 16 parallel digital filters, and then do some math-magic on the output (decimation in time filter?)

Best Answer

It should be possible to unroll this, but it will require 64*16 = 1024 MAC operations per clock cycle. Think about it like this:

y[n] = a0 * x[n] + a1 * x[n-1] + ... + a63 * x[n-63]

That's the filter operation that you need to do. Let's simplify that a bit and only consider the first 3 terms:

y[n] = a0 * x[n] + a1 * x[n-1] + a2 * x[n-2]

Each -1 is one clock cycle of delay. If you get one term per clock cycle, then you can implement that directly with 3 multipliers and three registers to store the x-values. However, if you get two x values per clock cycle, you also need to produce two y values per clock cycle. In that case, you need to do something like this, presuming your input values are x[2n] and x[2n+1]:

y[2n]   = a0 * x[2n]   + a1 * x[2n+1-2] + a2 * x[2n-2]
y[2n+1] = a0 * x[2n+1] + a1 * x[2n]     + a2 * x[2n+1-2]

And you can continue this for more inputs:

y[3n]   = a0 * x[3n]   + a1 * x[3n+2-3] + a2 * x[3n+1-3]
y[3n+1] = a0 * x[3n+1] + a1 * x[3n]     + a2 * x[3n+2-3]
y[3n+2] = a0 * x[3n+2] + a1 * x[3n+1]   + a2 * x[3n]

Note that in this case, each clock cycle of delay is NOT a delay of 1, so I have rewritten the terms as a sum of the original term and the delay. So for example, 2n gets moved to 2n-2 on the next cycle, and 2n+1 goes to 2n+1-2 on the next cycle. You can scale this pattern to what you need, however I would recommend using a Python script or similar to generate your HDL as this would be a nightmare to implement manually.

All in all, you will need parallel sample count * filter length MAC operations. Note that it may be possible in some cases to do two MAC operations in one DSP slice if it has a pre-adder and your filter coefficient list has a symmetry that you can exploit. So if you are using a modern Xilinx chip, it may be possible to implement this in 512 DSP slices.

Edit: Here's another option that's a little crazy, but it might be worth looking at. It is possible to build an FIR filter without using any DSP slices that's still reasonably fast - it's called a distributed arithmetic filter. The tradeoff is that for a coefficient width of M bits, it requires M clock cycles to compute the next sample. You're already doing 16 samples in parallel, it might be worth looking at trying a distributed arithmetic implementation that's 16*M in parallel. 16 bit samples * 16 samples would only be 256 parallel DA filter implementations. I have not done much with distributed arithmetic so I'm not sure exactly how well it scales, but it's another possible way to implement your filter. I'm not sure what FPGA you're using, but it's possible that you won't have enough multipliers to build a more standard design with DSP slices and DA may be the only option.