The only way I though to counter act this is to AND the flip flop's clock input with a clock that is faster than the main clock... that way data will be guaranteed to be clocked in at the end of that cycle.
This sounds to me like an architecture choice that will eventually limit the performance (maximum clock speed) you can achieve with your design. If your registers are able to function at the faster clock speed, you'll eventually want to try to get the whole system running as close to that clock speed as you can, but then you won't be able to have a "slow" clock and a "fast" clock to do this with.
In order to do this, I'm fetching data from memory, placing it on the data bus, then clocking it into a register all in a single operation. I'm worried that the rising edge of the main clock will happen at the register before the data is fetched from memory.... a sort of propagation delay / race condition.
First solution
One way that leaps to mind to solve this is to clock data out of the memory on the rising edge of the clock, and clock it in to the register on the falling edge. Since your register doesn't have a configuration bit for which edge it responds to (like it would if you were designing in an FPGA), you would have to generate the appropriate signal by using an inverter (NOT gate) between the "main" clock signal and the register.
More generally, it's possible to distribute several phases of your clock (e.g., 0, 90, 180, and 270 degrees) instead of just clock and inverted clock. And use these different phases to execute different actions at different times. Of course you have to do a fairly careful analysis of each interface where data is transferred from one phase to another to be sure setup and hold times are met.
To the best of my understanding (possibly out-of-date) multiphase clock designs were fairly common in the discrete logic design era, and were also common (and may still be common) in ASICs and custom chip designs. But they are fairly uncommon in FPGA design due to the complexity of the timing analysis.
Second solution
Another option is to create a controller state machine that enables and disables different elements on different clock cycles as needed. For example, you'd enable the memory output on cycle 1 and enable the register to latch in the data on cycle 2. Since your register apparently doesn't have a clock enable input, you might need to do this by ANDing a state machine output with the clock input to the register.
This type of design was fairly common in the era of discrete logic CPUs, and its what was taught in undergraduate digital logic courses in the early 90's. An elaborate version of this scheme is called a microcoded architecture.
Of course this architecture means that you need more than one clock cycle to complete each instruction. But it would be multiple cycles of your fast clock, not your original "slow" clock that would be used, and you are already using more than one cycle of the fast clock per instruction in your design.
You mention Altera so I assume this is an FPGA design.
If this really is the ideal case (you're ignoring the wire delays due to physical location of the elements) then yes, ideal tCOmin is 3ns. This is the best case (shortest delay, fastest clock) under ideal conditions.
However, post-synthesis simulation will reveal a more meaningful result once the circuit elements have been mapped to specific flip-flops at specific locations on the FPGA. This constrains the available wiring routes, which is where the wiring propagation delay is determined. For a large FPGA chip with high utilization, it becomes more likey that placement is not ideal. Since the worst (longest) tCOmin total delay is what constrains how fast the system can be clocked, the synthesis tools analyze the whole FPGA system and report on whichever circuit has the worst tCOmin, so that the designer can focus their efforts to make a large system perform as required.
Best Answer
The propagation delay will be 2ns because that is the time between the changing input (clock) and outputs (Q0-Q3). Propagation delay of the AND gates not relevant because their outputs should be stable when the flip flips are clocked. However they may limit the maximum frequency that the counter can be clocked at, because their outputs need to be correct when the clock occurs.
The JK input to the fourth flip flop (FF3) is determined by the states of Q0-Q2. These outputs are stable 2ns after the clock, but then pass through up to two AND gates. If the next clock occurs too early then FF3's JK input will not have had enough time to stabilize and the counter will malfunction.
The total delay between clock input and FF3 JK input is up to 2+3+3 = 8ns. Therefore the minimum acceptable time between clocks is 8ns + JK setup time. So long as this timing is met the counter should work correctly, and the propagation delay will be 2ns because Q outputs only change in response to clock inputs.