How does a CPU load multiple bytes at once if memory is byte addressed

assemblycpux86

I've been reading about CPUs and how they are implemented, and some big complex architectures (looking at you x86) have instructions that load from memory during one clock cycle. Since one address points to a single byte, how is it possible that I can write:

mov eax, DWORD PTR ds:[esi]

where I'm loading a double word (4 bytes!) from memory and chucking it into eax. How does this work with only one clock cycle? Wouldn't it have to access 4 addresses? The DWORD starts from ds:[esi] and ends up at [ds:[esi] - 3] meaning it has to compute 4 effective address, but it does it in one cycle.

How?

Thanks

Best Answer

Because the width of the data bus and the size of the smallest addressable unit are two separate things.

Just because you can specify addresses at the byte level, does not mean you have to have an 8 bit data bus. Most (possibly all) modern x86 processors use a 64 bit data bus and every time they read from memory, they read 64 bits. If you only requested 8 bits, the excess is simply discarded.

If you request more than 64 bits (for example, if loading 128 bit SSE registers), then there will be multiple memory accesses.

Many processors also have a concept of alignment, which basically means that every memory access is on a address divisible by the data bus width. Most can still fetch unaligned memory, but if it crosses an alignment boundary (for example, requesting 32 bits at address 0xFC on a 64 bit aligned system), you'll get multiple memory accesses, even if it would otherwise fit in the data bus.

Here's a few other notes regarding some aspects of your question:

A single memory access takes longer than one cpu clock cycle. Much, MUCH longer if it's not in L1 cache. See this post for rough orders of magnitude, and keep in mind that 1 nanosecond = 1 clock cycle at 1 GHz. Many desktop and laptop CPUs these days can run upwards of 3 GHz, or less than 0.333... nanoseconds per cycle.
One clock cycle does not equal one instruction. Instructions (even those that stay entirely within the CPU, not accessing any memory or peripherals) can take multiple cycles to complete. Additionally, multiple instructions can be executing at the same time (and I'm not referring to multiple cores or hyperthreading here, I mean multiple instructions simultaneously executing on a single core, without hyperthreading).

Best Answer

Because the width of the data bus and the size of the smallest addressable unit are two separate things.

Related Solutions

C Compiler – Implementing Non-Fixed Length Array Support

Memory – How Does the CPU Know When It Received RAM Data?

Related Topic