I'm trying to solve two questions about a RISC 5-staged pipeline that is not exactly like MIPS found here (everything is included in this post).
Consider the non-pipelined implementation of a simple processor that executes only
ALU instructions in the figure. The simple microprocessor has to perform several tasks. First, it computes the address of the next instruction to fetch by incrementing the PIC. Second, it uses the PC to access the I-cache. Then the instruction is decoded. The instruction decoder itself is divided into smaller tasks. First, it has to decode the instruction type. Once the opcode is decoded, it has to decode what functional units are needed for executing the instruction. Concurrently, it also decodes what source registers or immediate operands are used by the instruction and which destination register is written to. Once the decode process is complete, the register file is accessed (the immediate data are accessed from the instruction itself) to get the source data. Then the appropriate ALU function is activated to compute the results, which are then written back to the destination register. Note that the delay of every block is shown in the figure. For instance, it takes 6 ns to access I-cache , 4 ns to access register file, etc.
a. Generate a 5-stage (IF, ID1, ID2, EX, WB) pipelined implementation of the processor that balances each pipeline stage, ignoring all data hazards. Each sub-block in the diagram is a primitive unit that cannot be further partitioned into smaller ones. The original functionality must be maintained in the pipelined implementation. In other words, there should be no difference to a programmer writing code whether this machine is pipelined or otherwise. Show the diagram of your pipelined implementation.
c. What are the machine cycle times (in nano seconds) of the non-pipelined and the pipelined implementations?
I'm thrown off by ID1 and ID2 as MIPS usually has Fetch, Decode, Execute, Encode, Memory.
But here's my try
a) diagram F D1 D2 E WB F D1 D2 E WB F D1 D2 E WB F D1 D2 E WB F D1 D2 E WB
This would mean sequential progression, so 1) compute the address of next instruction to fetch by incrementing the PIC = 2+1 = 3 ns 2) PC to access I-cache = 6ns 3) instruction type decoder = 3.5ns 4) function decoder = 3ns 5) source, immediate, destination 2.5 + 3.5 + 4 = 10 ns 6) Register file = 4ns 6) ALU = 6ns 7) Register file = 4ns Total = 39.5 ns
Well, I'm really not sure. This CPU description is incomplete, for example there is no branch nor data memory access.
For the non-pipelined version, there is only one visible clock, the program counter. It is possible to calculate the next PC address while doing the ALU operations, propagation time is 6+3.5+4+4+1+6=24.5 ns All the decoders (source, operand...) are parallel, so only the longest delay is part of the timing "critical path". There is no clear indication of the delay needed for writing into the register file, maybe 4 ns more.
For the pipelined version :
F : I-cache : 6ns ( in parallel with the program counter update )
ID1 : Instruction decoder : 3.5ns
ID2 : Destination decoder : 4ns ( parallel with the other decoders, which are faster) + register file : 4ns
EX : ALU + MUX : 7ns
WB : register update : ??? ns
Max delay is around 8ns.
Alternatively, all decoding in ID1 (so 7.5ns) and register access in ID2 (4ns). Traditionally, the ALU is part of the EXECUTE stage.
Anyway, I think that this exercise is really poorly written.