Electronic – vhdl synthesis optimization: counters in statemachines

fpgavhdlxilinx

I have a general question about the efficiency of a synthesizable state machine.

The first version uses the same counter for each state.
The second uses one own counter for each state.
Which version of the two is more efficient (logic area, speed…)??

How much area of the FPGA is occupied by the routing of the count1 signal when i use the same counter for each state.
Is it better to user one counter for each state??

I'hope somebody with more experience can explain which solution is the best (maybe a third version) and why.

Thank you!

Kind Regards,

Oliver

— 1. Version ===============================================================

signal count1: integer range 0 to 1000 := 1000;  
type mystates is (s1, s2, s3, s4);  
signal mymode: mystates := s1;  

BEGIN  

MyProcess: process(clk)  

BEGIN  

    IF (clk'event and clk = '1') THEN  

        case mymode is  
        when s1 =>   
            If (count1 = 0) then  
            mymode <= s2;
            count1 <= 555;
            -- (stuff)
            else  
            count1 <= count1 - 1;  
            end if;  
        when s2 => 
            If (count1 = 0) then  
            mymode <= s3;
            count1 <= 666;
            -- (stuff)
            else  
            count1 <= count1 - 1;  
            end if; 
        when s3 => 
            If (count1 = 0) then  
            mymode <= s4;
            count1 <= 784;
            -- (stuff)
            else  
            count1 <= count1 - 1;  
            end if; 
        when s4 =>  
            If (count1 = 0) then  
            mymode <= s1;
            count1 <= 1000;
            -- (stuff)
            else  
            count1 <= count1 - 1;  
            end if; 
        when others =>  
            Null;  
        end case;  

    END IF;

end process;

— 2. Version ===============================================================

signal count1, count2, count3, count4: integer range 0 to 1000 := 1000;  
type mystates is (s1, s2, s3, s4);  
signal mymode: mystates := s1;  

BEGIN  

MyProcess: process(clk)  

BEGIN  

    IF (clk'event and clk = '1') THEN  

        case mymode is  
        when s1 =>   
            If (count1 = 0) then  
            mymode <= s2;
            count1 <= 555;
            -- (stuff)
            else  
            count1 <= count1 - 1;  
            end if;  
        when s2 => 
            If (count2 = 0) then  
            mymode <= s3;
            count2 <= 666;
            -- (stuff)
            else  
            count2 <= count2 - 1;  
            end if; 
        when s3 => 
            If (count3 = 0) then  
            mymode <= s4;
            count3 <= 784;
            -- (stuff)
            else  
            count3 <= count3 - 1;  
            end if; 
        when s4 =>  
            If (count4 = 0) then  
            mymode <= s1;
            count4 <= 1000;
            -- (stuff)
            else  
            count4 <= count4 - 1;  
            end if; 
        when others =>  
            Null;  
        end case;  

    END IF;

end process;

Best Answer

The first version is going to be more efficient in terms of area and speed-- but neither one is very good, IMHO.

A very quick way to estimate the size/speed of something is to think about each output signal and how many inputs the output is derived from. For example: a <= b xor c. We say that 'a' is a function of 2 signals (b and c). The more "inputs" there are, the more logic is required and the slower it will run. Keep in mind that this is a super rough estimate, but is useful for, well, making super rough estimates.

In version 1, you have your mymode "output" which is a function of many inputs. The important input is count1, which is a 10-bit signal. So, without considering the other signals, you can say that mymode is at least a function of 10 inputs. On the other hand, version 2 has 4 count inputs (each 10 bits), so it is a function of at least 40 inputs. That's a lot of inputs, and will create a lot of logic that runs slow.

Now, here's a super easy way to make the logic for both versions smaller and faster. For this example, I'm just going to do a simple counter that "does something" when it finishes it's count. You can adapt the same techniques to your state machine. First, here's your version:

signal count :integer range 0 to 1000;  

process (clk)
begin
  if rising_edge(clk) then
    if load='1' then
      count <= some_constant_value;
    elsif count=0 then  -- This line is important
      do something here;
    else
      count <= count - 1;
    end if;
  end if;
end process;

And here is my version:

signal count :std_logic_vector (10 downto 0);  -- Note:  1 extra bit

process (clk)
begin
  if rising_edge(clk) then
    if load='1' then
      count <= some_constant_value - 1;
    elsif count(count'high)='1' then  -- This line is important
      do_something_here;
    else
      count <= count - 1;
    end if;
  end if;
end process;

Your version counts from N downto 0, while mine counts from N-1 downto -1. To make this work, I made the count signal 1 bit larger and also a SLV instead of an integer. But where this really makes things faster is that your version is doing an N-bit comparison where mine is just checking a single bit. In essence, I am using the carry-chain logic from the line "count <= count - 1" to also do my comparison. The carry-chain logic is already there for the counter, I'm just making it one bit longer. Since the carry-chain logic is super fast in an FPGA, and you're already using it, the resulting logic is super small and super fast.

For our 10 bit counter, the line "elsif count=0 then" would require three 4-input LUTs and 2 levels of logic in a Xilinx Spartan-3. My version requires 1 Flip-Flop (and associated carry-chain that would have otherwise gone unused) and essentially 0 levels of logic.

But let's say that the counter was 32 bits. Your version would require 11 LUTs and 3 levels of logic. Mine would stay the same at 1 FF and 0 levels.

When you apply my method to the two versions of your state machine, each version will work. Version 1 is still the better approach, but in some circumstances you can't have a single counter and so you must use Version 2. With my method, instead of having a function of at least 40 inputs, you have a function of at least 4 inputs. 4 is much better than 40!

Related Solutions

Electronic – VHDL – How to reduce signal’s dependencies and optimize speed

In general, the usual answer to this sort of problem is to pipeline. You might consider adding pipeline registers immediately after the 10-bit comparators, before the logic that combines them into the enable signal for the next stage. To keep the resulting enable signal aligned with the correct data in the data path, you'll probably also need a pipeline register for the data, too.

But yes, you can also use the technique described in the other question. For your specific 10-bit counter example, instead of counting from 0 to 1003 and using a comparator to identify state 999 to turn off the enable signal, you could make it an 11-bit counter that counts from -1000 to 3. The MSB of this counter is your enable signal, and when the count gets to 3^[1], you reload the counter with -1000 ... and also load an auxiliary 9-bit count-down counter with the value 249. Each time this auxiliary counter reaches -1 (MSB set) is the start of another subframe (in addition to the one that starts at the beginning of the main frame).

^[1]Note that detecting "3" is a function of just 3 bits — the MSB and the two LSBs — not a function of 11 bits.

Electronic – VHDL: receive module randomly fails when counting bits

I don't see a synchronizer on the rx data line.

All asynchronous inputs must be synchronized to the sampling clock. There are a couple of reasons for this: metastability and routing. These are different problems but are inter-related.

It takes time for signals to propagate through the FPGA fabric. The clock network inside the FPGA is designed to compensate for these "travel" delays so that all flip flops within the FPGA see the clock at the exact same moment. The normal routing network does not have this, and instead relies on the rule that all signals must be stable for a little bit of time before the clock changes and remain stable for a little bit of time after the clock changes. These little bits of time are known as the setup and hold times for a given flip flop. The place and route component of the toolchain has a very good understanding of the routing delays for the specific device and makes a basic assumption that a signal does not violate the setup and hold times of the flip flops in the FPGA. With that assumption and knowledge (and a timing constraints file) it can properly place the logic within the FPGA and ensure that all the logic that looks at a given signal sees the same value at every clock tick.

When you have signals that are not synchronized to the sampling clock you can end up in the situation where one flip flop sees the "old" value of a signal since the new value has not had time to propagate over. Now you're in the undesirable situation where logic looking at the same signal sees two different values. This can cause wrong operation, crashed state machines and all kinds of hard to diagnose havoc.

The other reason why you must synchronize all your input signals is something called metastability. There are volumes written on this subject but in a nutshell, digital logic circuitry is at its most basic level an analog circuit. When your clock line rises the state of the input line is captured and if that input is not a stable high or low level at that time, an unknown "in-between" value can be captured by the sampling flip flop.

As you know, FPGAs are digital beasts and do not react well to a signal that is neither high nor low. Worse, if that indeterminate value makes its way past the sampling flip flop and into the FPGA it can cause all kinds of weirdness as larger portions of the logic now see an indeterminate value and try to make sense of it.

The solution is to synchronize the signal. At its most basic level this means you use a chain of flip flops to capture the input. Any metastable level that might have been captured by the first flip flop and managed to make it out gets another chance to be resolved before it hits your complex logic. Two flip flops are usually more than sufficient to synchronize inputs.

A basic synchronizer looks like this:

entity sync_2ff is
port (
    async_in : in std_logic;
    clk : in std_logic;
    rst : in std_logic;
    sync_out : out std_logic
);
end;

architecture a of sync_2ff is
begin

signal ff1, ff2: std_logic;

-- It's nice to let the synthesizer know what you're doing. Altera's way of doing it as follows:
ATTRIBUTE altera_attribute : string;
ATTRIBUTE altera_attribute OF ff1 : signal is "-name SYNCHRONIZER_IDENTIFICATION ""FORCED IF ASYNCHRONOUS""";
ATTRIBUTE altera_attribute OF a : architecture is "-name SDC_STATEMENT ""set_false_path -to *|sync_2ff:*|ff1 """;

-- also set the 'preserve' attribute to ff1 and ff2 so the synthesis tool doesn't optimize them away
ATTRIBUTE preserve: boolean;
ATTRIBUTE preserve OF ff1: signal IS true;
ATTRIBUTE preserve OF ff2: signal IS true;

synchronizer: process(clk, rst)
begin
if rst = '1' then
    ff1 <= '0';
    ff2 <= '0';
else if rising_edge(clk) then
    ff1 <= async_in;
    ff2 <= ff1;
    sync_out <= ff2;
end if;
end process synchronizer;
end sync_2ff;

Connect the physical pin for the N64 controller's rx data line to the async_in input of the synchronizer, and connect the sync_out signal to your UART's rxd input.

Unsynchronized signals can cause weird issues. Make sure any input connected to an FPGA element that isn't synchronized to the clock of the process reading the signal is synchronized. This includes pushbuttons, UART 'rx' and 'cts' signals... anything that is not synchronized to the clock that the FPGA is using to sample the signal.

(An aside: I wrote the page at www.mixdown.ca/n64dev many years ago. I just realized that I broke the link when I last updated the site and will fix it in the morning when I'm back at a computer. I had no idea so many people used that page!)

Best Answer

Related Solutions

Electronic – VHDL – How to reduce signal’s dependencies and optimize speed

Electronic – VHDL: receive module randomly fails when counting bits

Related Topic