Electronic – FPGA double buffer strategy

bufferfpgavhdl

I am working on a FPGA project where a host CPU writes a 10,240 x 16-bit look up table into FPGA logic. To implement this, I've utilized on-chip memory to store the values and read them out when ready.

An external trigger/go pulse kicks off a processing cycle which lasts several hundred thousand clock cycles. Once we get this trigger, the state of the 10,240×16 LUT needs to be frozen or latched, so it can be utilized during the processing cycle. Unfortunately, the data needs to be available fairly soon after this "GO" pulse, so there is not enough time to do a complete buffer copy.

The host also needs to be able to continually update some values of the look up table while the current cycle is being executed, in order to setup for the next processing cycle. To allow for both cases (latching the state of the lookup table, but also letting the host update it whenever), I think that double-buffering ping/pong style is the way to go: The host writes to one buffer until we get to "GO" command, then the host writes to the other. The FPGA logic always reads out of the buffer not being written to.

However, since the host is not rewriting all 10,240×16 values when it does its sporadic updates, the buffer that is not being written to is essentially "dropping" the updates while it's frozen.

Is there a novel way to handle this scenario? I'm thinking there needs to be some kind of buffer resynchronization process once the buffer is unfrozen.

Best Answer

One possible strategy could be to use stale bits. Dunno if that's standard terminology, but it's similar to a dirty bit. Writing a new entry will clear the corresponding stale bit in the unlocked buffer and set the bit in the locked buffer. After switching buffers, have an internal copy routine transfer every entry marked stale in the unlocked buffer from the locked buffer to the unlocked buffer. In this way, new data written while the copy is in progress will not be overwritten, and all the old updates should be retained. The only thing you need to do is ensure that there is enough time for the copy operation to complete between buffer switches, or you need some sort of optimization to keep track of which entries are stale so you don't have to iterate over all of them, making the copy operation faster.

Another possible strategy could be to store the updated entries while a single buffer is locked, then apply only those updates when it is unlocked. If only a handful of entries are updated, then this might be more efficient. The updates could be stored as a linked list or similar data structure so that the list can be traversed efficiently while multiple updates to the same location can be coalesced.

Related Solutions

Electronic – FPGA VGA Buffer. How to read and write

A couple of approaches which may be useful for some styles of display is to divide the display panel into tiles, and

restrict each tile to using a small set of colors, allowing the use of fewer than 8 bits per pixel, or
use a byte or two from each tile to select a location from which to read bitmap data.

The first approach could reduce the rate at which data had to be read from display memory. For example, if one used tiles that were 16x16 and could each have four colors chosen from a set of 256, then without using any extra RAM in the FPGA one could reduce the number of memory reads per 16 pixels to eight (four color values, plus four bytes for the bitmap). If one added 160 bytes' worth of buffering/RAM(*) to the FPGA, one could reduce the number of memory reads per 16 pixels to four, using an extra 160 reads every 16 scan lines to read the next set of tile colors. If one wanted 16 colors per tile, the second approach would require an extra 640 bytes of RAM unless one placed some restrictions on the number of different palettes that could exist on a line.

The second approach would probably increase rather than reduce the total memory bandwidth required to produce a display, but would reduce the amount of memory that would have to be updated to change the display--one could change a byte or two to update an 8x8 or 16x16 area of the screen. Depending upon what you're trying to display, it may be helpful when using this style of approach to use one memory device to hold the tile shapes, and another to hold the tile selection. One might, for example, use a fast 32Kx8 RAM to hold a couple 80x60 tile maps with two bytes per tile. If the FPGA didn't have any buffering, it would have to read one byte every four pixels; even with a 40ns static RAM, that would leave plenty of time for the CPU to update the display (an entire screen would only be 9600 bytes). The memory bandwidth for reading out the tile shapes would be no better than it is now, but that part of memory wouldn't have to be updated.

Incidentally, if one didn't want to add a 32Kx8 RAM but could add add 320 bytes of buffering/RAM(**) to the FPGA, one could use a tile-map approach but have the CPU or DMA feed 160 bytes to the display every 8 scan lines. That would burden the controller somewhat even when nothing on the display was changing, but could simplify the circuitry.

(*) The buffer could be implemented as RAM, or as a sequence of 32 40-bit-long shift registers plus a little control logic.

(**) The buffer could be implemented as two 160-byte RAMs, or as two groups of sixteen 80-bit shift registers.

Electronic – fpga clock strategy

You can create as many clocks as you want, and you can use PLLs or DCMs to create arbitrary clocks. The question is whether you need to, or if you should be doing it a different way.

I find that I end up running as much logic at a common or "core" clock frequency, say the 54MHz that you are using, but I need to trigger certain processes to run periodically. Say a 100ms debounce, a 10kHz PWM update, a 1s timer tick for wall clock, you get the idea. Instead of generating these clocks, I instead run everything at the core clock frequency and generate arbitrary clock enable signals.

You generally don't want to create divided clocks for several reasons. Logic-generated clocks are jittery, the tools may end up routing these "clock" signals along routing paths intended for logic (since they're generated from logic) and as mentioned above and by others, PLLs and DCMs are much better options if you really need to generate a different clock.

Clock gating is what you want. The device primitives have an additional clock enable signal which "gates" the clock signal, allowing to propagate into the primitive or not. When the clock enable is negated, the FF doesn't see the clock and effectively holds its state as if the clock pulse never occurred. When the clock enable signal is asserted the FF sees the clock normally and things proceed as expected. Clock enables are designed specifically to control an FF's access to its clock and as such don't have issues with generating runt clocks. They also don't take up any additional resources, so use them.

e.g. generating a clock in logic. This is bad, don't do this:

process gen_100ms_clk (clk, rst)
variable ctr: integer range 0 to 5399999;

begin
    if rst = '1' then
        ctr := 0;
        out <= '0';
    elsif rising_edge(clk) then
        if ctr = ctr'high then
            out <= not out;
            ctr := 0;
        else
            ctr := ctr + 1;
        end if;
    end if;
end process gen_100ms_clk;

This code has the out signal toggle state every 100ms; This signal would be a poor choice to use as the clock signal of a new process, such as here:

process do_100ms(out, rst)
begin
    if rising_edge(out) then
        ...
    end if;
end process do_100ms;

This is bad because the FFs in the do_100ms() process are using a signal created through the logic in the gen_100ms_clk() process.

Instead, use a clock enable, as shown here:

process gen_100ms_ce (clk, rst)
variable ctr: integer range 0 to 5399999;

begin
    if rst = '1' then
        ctr := 0;
        out <= '0';
    elsif rising_edge(clk) then
        if ctr = ctr'high then
            out <= '1';
            ctr := 0;
        else
            out <= '0';
            ctr := ctr + 1;
        end if;
    end if;
end process gen_100ms_clk;

Now gen_100ms_ce() creates an out signal that is high for 1T every 100ms. This is a great way to signal to your code that it's time to do something:

process do_100ms(clk, rst)
begin
    if rising_edge(clk) then
        if out = '1' then
            ...
        end if;
    end if;
end process do_100ms;

Now your do_100ms() process is running at the same 54MHz clock as everything else and it uses a proper clock enable to trigger whatever you want to happen every 100ms.

Take a look at the RTL output of your toolset; you'll see that the primitive used in your do_100ms() process will use its clock enable signal.

This method also achieves power savings since there will be large swaths of logic that stay "static" for long amounts of time even though the global clock net is wiggling away at 54MHz in your case. Once every 100ms in my example above, all the clocks which are gated with the 100ms enable become active for 1T and then are static again for another 99.9999815ms. :-) CMOS consumes very little power when it's not changing state, so the only power consumption in the logic with the gated-off clock is in the leakage currents of its logic.

You can extend this into a full-out means of power management. You create clock enables for all the subsystems and your power manager negates the clock enable for whichever subsections you dont' want powered.

Best Answer

Related Solutions

Electronic – FPGA VGA Buffer. How to read and write

Electronic – fpga clock strategy

Related Topic