You have two requirements here:
- You need to make sure you sample that data at the correct time according to the bus needs.
- You need to synchronize this data into your system clock domain.
There are a couple of ways to meet these needs.
First, if the bus clock is slow enough below your system clock, then you can synchronize the bus clock to your clock domain with a double flop. Then use a simple edge detector to determine when the rising edge is. This is used to safely sample the data line of the bus. Note that this achieves #2 automatically. Also, as noted here, this is the preferred way to do this in an FPGA.
This method has a drawback in that the data is sampled somewhere between two and three system clocks later than the actual 'on the pin' bus clock edge. If this is too long (due to your system clock not being fast enough in comparison), you have to go an alternative way.
In this method, you sample before synchronizing to the system clock domain. The reason is to make sure you are sampling at the right time according to the bus.
always @(posedge bclk) begin // positive edge of bus clock
sampled_data <= data;
end
At this point, you have sampled_data which is a signal in the bclk domain. You need to synchronize it to your system clock domain. To do this, you have to use handshaking or a FIFO.
One way that works is to do the shift register in the bus clock domain to get to parallel data. Then pass it through a dual clock FIFO to the other domain. FPGAs have primitives just for this use.
// Latch the data in the bclk domain
always @(posedge bclk) begin
fifo_d <= {fifo_d[30:0], data};
data_count <= data_count + 1;
if (data_count == 31)
fifo_wr <= 1'b1; // This will latch fifo_d to the dual clock fifo input
else
fifo_wr <= 1'b0;
end
// Read the data out of the FIFO in the system clock domain.
always @(posedge clk) begin
if (fifo_ready)
synchronized_data <= fifo_q; // Now the data is in your domain.
end
Other notes:
As with any I/O at the edge of the FPGA as well as between clock domains, you will need to correctly define the timing constraints. Do describe the details is a bit too general for this forum, but as recommended in the comments by Greg, this paper is a good source for understanding the needs for the clock domain crossing. The FPGA vendors tend to have decent write ups for input delay and output delay definitions as well.
It should be possible to unroll this, but it will require 64*16 = 1024 MAC operations per clock cycle. Think about it like this:
y[n] = a0 * x[n] + a1 * x[n-1] + ... + a63 * x[n-63]
That's the filter operation that you need to do. Let's simplify that a bit and only consider the first 3 terms:
y[n] = a0 * x[n] + a1 * x[n-1] + a2 * x[n-2]
Each -1 is one clock cycle of delay. If you get one term per clock cycle, then you can implement that directly with 3 multipliers and three registers to store the x-values. However, if you get two x values per clock cycle, you also need to produce two y values per clock cycle. In that case, you need to do something like this, presuming your input values are x[2n] and x[2n+1]:
y[2n] = a0 * x[2n] + a1 * x[2n+1-2] + a2 * x[2n-2]
y[2n+1] = a0 * x[2n+1] + a1 * x[2n] + a2 * x[2n+1-2]
And you can continue this for more inputs:
y[3n] = a0 * x[3n] + a1 * x[3n+2-3] + a2 * x[3n+1-3]
y[3n+1] = a0 * x[3n+1] + a1 * x[3n] + a2 * x[3n+2-3]
y[3n+2] = a0 * x[3n+2] + a1 * x[3n+1] + a2 * x[3n]
Note that in this case, each clock cycle of delay is NOT a delay of 1, so I have rewritten the terms as a sum of the original term and the delay. So for example, 2n gets moved to 2n-2 on the next cycle, and 2n+1 goes to 2n+1-2 on the next cycle. You can scale this pattern to what you need, however I would recommend using a Python script or similar to generate your HDL as this would be a nightmare to implement manually.
All in all, you will need parallel sample count * filter length MAC operations. Note that it may be possible in some cases to do two MAC operations in one DSP slice if it has a pre-adder and your filter coefficient list has a symmetry that you can exploit. So if you are using a modern Xilinx chip, it may be possible to implement this in 512 DSP slices.
Edit: Here's another option that's a little crazy, but it might be worth looking at. It is possible to build an FIR filter without using any DSP slices that's still reasonably fast - it's called a distributed arithmetic filter. The tradeoff is that for a coefficient width of M bits, it requires M clock cycles to compute the next sample. You're already doing 16 samples in parallel, it might be worth looking at trying a distributed arithmetic implementation that's 16*M in parallel. 16 bit samples * 16 samples would only be 256 parallel DA filter implementations. I have not done much with distributed arithmetic so I'm not sure exactly how well it scales, but it's another possible way to implement your filter. I'm not sure what FPGA you're using, but it's possible that you won't have enough multipliers to build a more standard design with DSP slices and DA may be the only option.
Best Answer
You say you have a frequency divider. But that is just the beginning. Indeed you have to add a synchroniser for the serial input. I looked at the datasheet and you need an SPI interface without the transmit part. That means you also need a chip select, serial/parallel converter, . I am not going to write that for you (After all that is what I earn my money with) so I am going to give you the most important snippets:
The Maxim datasheet says the data is changing max 40ns after the falling clock edge. So pick it up just before.