EDIT: This question led to a long discussions. It is crucial to understand that the fact that CPUs speeds haven't been increasing over the last years is related to commercial aspects, and not directly related to any engineering or physical problem. You can check this link for the topmost frequencies achieved with existing CPUs by overclocking and supercooling.
From the invention of the first PC and until early 2000's the main parameter of each CPU was its frequency (maximal frequency of operation). Manufacturers tried to come up with new technologies which will allow for higher frequencies, and chip designers worked very hard to develop micro-architectures which will allow to the chip to run on a higher frequency.
However, as chips became smaller and faster, the problem of heat dissipation arose – when the whole amount of heat generated by switching transistors couldn't be dissipated, the chips got damaged. Engineers started to attach heat sinks to processors, then fans, but eventually they concluded that the approach of increasing CPU's frequency is no longer practical in terms of added performance per added cost.
In other words: CPU frequencies can be raised, but this makes CPUs (in fact, not the CPUs but the cooling mechanisms) too expensive. Consumers won't buy expensive computers if there is an alternative.
In general, current technological processes allow very high frequency operation (way above ~3GHz which Intel usually uses, and even AMD's 5GHz is not the ceiling). However, the associated cost of cooling devices which are required at these high frequencies is too high.
I'd like to emphasize this: there is no physical effect that prevents development of 8-10GHz processors with current technology. However, you'll have to provide a very expensive cooling mechanism in order to prevent such a processor from burning out.
Moreover, processors usually work in "burst" – they have very long idle periods, followed by short, but very intensive (and therefore high energy consuming) periods. Engineers could build a 10GHz processor that works at the highest frequencies for short periods of time (and no additional cooling is required because the periods are short), but this approach was also declined as worthless (high investments in development as compared to questionable gains). However, following future micro-architectural improvements, this approach may be reconsidered. It is my belief that this 5GHz AMD processor does not work constantly at 5GHz, but raises its internal clock to a maximum during short bursts.
PHYSICAL LIMIT:
There is a physical limit to a maximal achievable clock rate for each process technology (which depends on technology's minimal feature size), however I think that the last Intel's processor which had been really pushed to this limit was Pentium 4. This means that today, when the technology advances and the minimal feature size is reduced (meanwhile in accordance with Moore's law), the only benefit from this reduction is that you can fit more logic into the same area (engineers no longer push CPU frequency to the limits of the technology).
BTW, the above limit can't increase forever. Read about Moore's law and the problems associated with its further application.
Don't forget that the ARM processor runs at much faster speed than the programmable logic. It runs somewhere between 666MHz to 1GHz while your logic runs at 100MHz. 100MHz seems pretty slow, you can probably ramp it up to 150-200MHz. Multiplying 2 matrices requires more operations, more data dependency, more memory access, etc. In those case, it's easier to take advantage of the FPGA's parallelism, multiplying by a constant is simply not complex enough. That said, you should have better result.
The 1343 cycles to transfer 4096 bytes seems a little slow, but not too off if your design is under stress. You would get better rates if you used 64 bits AXI (I guessed you used 32 bits) and configure the AXI-DMA to use larger burst length.
The thing that worries me in your results is the 3654 it took you to perform the matrix-constant multiply algorithm. I would expect something closer to the 1343 cycles it took you for the DMA transfer, which would if you pipeline your operations properly. It seems you transfer data from RAM to your IP, then mutiply the matrix, then transfer from your IP to the RAM, taking around 1024 cycles for each operation.
It should all be done at the same time: transfer from ram to IP, multiply incoming data (without storing) and send them off the S2MM port. In that case, it would take 1024 cycles + latency through the cores.
Best Answer
Since you have synchronous registers between the combinatorial blocks, the minimum time is the larger of the minimum time for each block.
The S block will process the data generated by the T block during the last clock period, while the S block processes the next data items.
So you can increase \$f_{max}\$ by shrinking your combinatorial blocks and putting registers in between, but the results will arrive on a later clock cycle then.
It is quite possible that the synthesis identified the multipliers in your design and mapped them to dedicated multiplier blocks, reducing the settling time for S significantly.
It is also possible that you have an error in your design that allows the compiler to optimize out functionality, e.g. by not routing output signals to pins you are allowing the compiler to remove the entire design as it has no externally visible effects.