The first version is going to be more efficient in terms of area and speed-- but neither one is very good, IMHO.
A very quick way to estimate the size/speed of something is to think about each output signal and how many inputs the output is derived from. For example: a <= b xor c. We say that 'a' is a function of 2 signals (b and c). The more "inputs" there are, the more logic is required and the slower it will run. Keep in mind that this is a super rough estimate, but is useful for, well, making super rough estimates.
In version 1, you have your mymode "output" which is a function of many inputs. The important input is count1, which is a 10-bit signal. So, without considering the other signals, you can say that mymode is at least a function of 10 inputs. On the other hand, version 2 has 4 count inputs (each 10 bits), so it is a function of at least 40 inputs. That's a lot of inputs, and will create a lot of logic that runs slow.
Now, here's a super easy way to make the logic for both versions smaller and faster. For this example, I'm just going to do a simple counter that "does something" when it finishes it's count. You can adapt the same techniques to your state machine. First, here's your version:
signal count :integer range 0 to 1000;
process (clk)
begin
if rising_edge(clk) then
if load='1' then
count <= some_constant_value;
elsif count=0 then -- This line is important
do something here;
else
count <= count - 1;
end if;
end if;
end process;
And here is my version:
signal count :std_logic_vector (10 downto 0); -- Note: 1 extra bit
process (clk)
begin
if rising_edge(clk) then
if load='1' then
count <= some_constant_value - 1;
elsif count(count'high)='1' then -- This line is important
do_something_here;
else
count <= count - 1;
end if;
end if;
end process;
Your version counts from N downto 0, while mine counts from N-1 downto -1. To make this work, I made the count signal 1 bit larger and also a SLV instead of an integer. But where this really makes things faster is that your version is doing an N-bit comparison where mine is just checking a single bit. In essence, I am using the carry-chain logic from the line "count <= count - 1" to also do my comparison. The carry-chain logic is already there for the counter, I'm just making it one bit longer. Since the carry-chain logic is super fast in an FPGA, and you're already using it, the resulting logic is super small and super fast.
For our 10 bit counter, the line "elsif count=0 then" would require three 4-input LUTs and 2 levels of logic in a Xilinx Spartan-3. My version requires 1 Flip-Flop (and associated carry-chain that would have otherwise gone unused) and essentially 0 levels of logic.
But let's say that the counter was 32 bits. Your version would require 11 LUTs and 3 levels of logic. Mine would stay the same at 1 FF and 0 levels.
When you apply my method to the two versions of your state machine, each version will work. Version 1 is still the better approach, but in some circumstances you can't have a single counter and so you must use Version 2. With my method, instead of having a function of at least 40 inputs, you have a function of at least 4 inputs. 4 is much better than 40!
Actually your first guess is not as afar off as some are claiming.
A CPU is built around something called an "Arithmetic Logic Unit" (ALU) and a simplistic implementation of that is to have the logic gates implementing all basic operations wired up to the inputs in parallel. All of the possible elementary computations are thus performed in parallel, with the output of the actually desired one selected by a multiplexor.
In an extremely simple (chalk-board-model) CPU, a few bits of the currently executing instruction opcode are wired to that multiplexor to tell it which logic function result to use. (The other, undesired results are simply wasted)
The actual technology used to implement the computations in the ALU varies - it could be "real" logic gates, or it could be LUT's if the CPU is implemented inside an LUT-based FPGA (one very good way to understand the essentials of stored-program computing is to design a simple processor and build it in a logic simulator and perhaps then an FPGA).
Best Answer
A LUT is a memory (Look. Up. Table). It implements logic truth tables by using the memory address as the input bits, and the memory data output as the output bits.
By way of example, for a 4-bit up counter: at address 4'b0000 is stored 4'b0001; at address 4'b0001 is stored 4'b0010; ... ; at address 4'b1011 is stored 4'b1100; and so forth. The outputs are registered and fed back to the inputs, so on every clock the cycle repeats and the counter output increments.