Does NVIDIA Tegra SoC’s and AMD Fusion APU’s bypass the GPUCPU memory transfer bottleneck

The major limitation in heterogeneous CPU+GPU programming seems to be the slow memory transfer across the PCI-e bus when data needs to be passed back and forth between the Device and Host. I have read that the goal of AMD's Fusion APU's is to solve this problem. Does the Fusion APU attempt to solve this problem by having the GPU and CPU share a common physical memory region? And what I am really wondering is if the Tegra K1 (or X1) also attempt to solve this problem by having the CPU and GPU share a common physical memory region, therefore not needing to perform a cudaMemcpy.

Best Answer

Well I know that Intel somehow solved this problem by placing the GPU into the same NUMA switching complex as the CPU cores. This is implied from Intel Xeon E3-12x0v3/12x1v3 specs indicating for those Xeon E3-1200s with i7-compatible sockets but no GPU, the memory bandwidth that was used by GPU on i7s and E3-12x5v3/12x6v3 is made available to the CPU cores.

This does suggest that the shared L2 and L3 cache on Intel processors is available to both CPU cores and built-in GPU cores. So as long as the data fit in cache, no main memory access is needed.