The control signals are generated by the Decode stage. If I remember correctly (it's been like seven years...) the control signals are then passed through the pipeline registers between each stage, so they propagate down the pipeline with the rest of the instruction's data.
I think the question is actually asking how many cycles it takes to generate the control signals. For a single-cycle and pipelined implementation, it only takes one cycle to decode the instruction (for the pipeline, the results of the decode are passed to each stage in succession once generated).
However, the multi-cycle processor will generate different control signals during each cycle.
Instructions before (in program order) the data cache miss will flow down the pipeline as normal. (An unusual exception would be a push-based pipeline as used by some early VLIWs. Such required subsequent operations to push previous operations down the pipelines.)
For a cache miss on a store, the stored value can be placed into a buffer allowing the store to complete despite the cache miss. (This is possible because the buffer does not require any data from memory, typically accomplished by having a valid bit for each storable unit [typically byte].)
Many processors using in-order execution allow instructions after a load to execute and complete—even another load—if the following instructions are not data dependent on the load (or, of course, after an instruction that is data dependent). This can be accomplished by the use of a scoreboard marking the availability of each register.
For an out-of-order processor, instructions after an instruction that is dependent on the cache missing load instruction can be completely executed and results stored in rename registers (or a store queue for stores to memory) but not committed.
Control flow instructions like branches and indirect jumps are special in that following instructions are dependent on the result, but often prediction can be effectively used to hide this dependency. Although value prediction for load misses has been studied, the benefit is relatively limited given the cost.
In theory it would sometimes also be possible to speculatively partially execute dependent instructions. E.g.:
lw r3, [r5]; // load word
add r3, r3, #50; // r3 = r3 + 50
slt r6, r3, #1000; // (r3<1000)?r6=1:r6=0
bez r6 LABEL; // if r6=0 goto LABEL
addi r3, r3, #10; // r3 = r3 + 10
LABEL:
Theoretically, hardware could speculate that the branch is not taken and add 50 and 10 so that 60 would be added to the value when it becomes available. This kind of optimization has been proposed for (instruction) trace caches.
Some instructions may also be broken into component operations which are not dependent on the not-yet-available value to allow partial execution of the instruction. E.g., division using a Newton-Raphson mechanism can generate the reciprocal while the dividend is unavailable.
Best Answer
Each step of a multicycle machine should be shorter than the step in a singlecycle machine. Still you may get a longer total execution time adding all cycles of a multicycle machine. The only performance advantage of the non-pipelined multicycle machine is that execution times for instructions don't have to be equal (in a singlecycle machine execution time of all instructions equal the execution time of the worst case instruction) but some instuctions can be executed in shorter time than others. On the average, however, you don't win much.
So you may wonder why bother about multicycle machines?
The answer is: In teaching (and learning) computer architecture multicycle machines are introduced as a preparation for pipelined multicycle machines which will bring the performance improvement you expect.
Those non-pipelined mutlicycles machines are rather an instrument of teaching. Nobody would build them an try to sell them as they are more complex and in most cases not much more performant than singlecycle machines. They help, however, understanding pipelined machines.