Zedboard clock cycles analysis – Valuable Tech Notes

Based on the example in here, I tried a very similar example (but instead of multiplying two matrices I just multiply all the elements in a matrix by 2.0).

However, when comparing the results of multiplying a 32×32 matrix by 2.0 in the ARM (after the optimization -O3) with the results in the Hardware (i.e. FPGA side) I noticed that the first took me 1425 clock cyles where the second took 3654 clock cycles. So basically, the FPGA is almost 3 times slower. (accelaration_factor=0.389)

See this to check the accelaration factors that I'm talking about in the matrix mult example.

I already tried changing the port that connects the ARM and the AXI DMA block to HP instead of ACP and the results are the same.

I'm using AXI DMA also to transfer data rom and to the DDR and I measured the MM2S (Memory-Mapped to Stream) transfer to 1343 clock cycles to transfer 4096 bytes, which results in a transfer spped of 290.8 Mbytes/second. The S2MM transfer in turn has a velocity of 167.2 MBytes/s because it transfered 4096 bytes in 2336 clock cycles.

I have multiple questions in which I hope you can help:

Why is my FPGA design slower than the ARM when multiplying a matrix by 2.0 but not when multiplying two matrices??
Do these AXI DMA velocities look okay to you? By comparing them to Sadri's video it seems that I can transfer way faster. What can I do to improve these transfers speeds?
I saw somewhere that S2MM transfers are expected to be slower than MM2s transfers in the Zedboard. Can you tell me why and if this big of a difference makes sense?
I measured the time in my PC to do a 32×32 matrix multiplication by 2.0 and it's 3.84×10⁽⁻⁶⁾ seconds. Knowing that the same multiplication takes 1.42×10⁽⁻⁵⁾ and the FPGA one takes 3.85×10⁽⁻⁵⁾ one can notice that the CPU is almost 4 times faster than the ARM and almost 10 times faster than the FPGA. If my objective was to design an FPGA model that accelarates software why am I so far off when I'm following an example?

Note: My frequency is 100 MHz so each clock cycle is 10ns.

Best Answer

Don't forget that the ARM processor runs at much faster speed than the programmable logic. It runs somewhere between 666MHz to 1GHz while your logic runs at 100MHz. 100MHz seems pretty slow, you can probably ramp it up to 150-200MHz. Multiplying 2 matrices requires more operations, more data dependency, more memory access, etc. In those case, it's easier to take advantage of the FPGA's parallelism, multiplying by a constant is simply not complex enough. That said, you should have better result.

The 1343 cycles to transfer 4096 bytes seems a little slow, but not too off if your design is under stress. You would get better rates if you used 64 bits AXI (I guessed you used 32 bits) and configure the AXI-DMA to use larger burst length.

The thing that worries me in your results is the 3654 it took you to perform the matrix-constant multiply algorithm. I would expect something closer to the 1343 cycles it took you for the DMA transfer, which would if you pipeline your operations properly. It seems you transfer data from RAM to your IP, then mutiply the matrix, then transfer from your IP to the RAM, taking around 1024 cycles for each operation.

It should all be done at the same time: transfer from ram to IP, multiply incoming data (without storing) and send them off the S2MM port. In that case, it would take 1024 cycles + latency through the cores.

Best Answer

Related Solutions

Electronic – DDR2 CAS Latency – is it fixed to clock-cycles or time

Electronic – minimum clock cycles needed

Related Topic