Because the width of the data bus and the size of the smallest addressable unit are two separate things.
Just because you can specify addresses at the byte level, does not mean you have to have an 8 bit data bus. Most (possibly all) modern x86 processors use a 64 bit data bus and every time they read from memory, they read 64 bits. If you only requested 8 bits, the excess is simply discarded.
If you request more than 64 bits (for example, if loading 128 bit SSE registers), then there will be multiple memory accesses.
Many processors also have a concept of alignment, which basically means that every memory access is on a address divisible by the data bus width. Most can still fetch unaligned memory, but if it crosses an alignment boundary (for example, requesting 32 bits at address 0xFC on a 64 bit aligned system), you'll get multiple memory accesses, even if it would otherwise fit in the data bus.
Here's a few other notes regarding some aspects of your question:
- A single memory access takes longer than one cpu clock cycle. Much, MUCH longer if it's not in L1 cache. See this post for rough orders of magnitude, and keep in mind that 1 nanosecond = 1 clock cycle at 1 GHz. Many desktop and laptop CPUs these days can run upwards of 3 GHz, or less than 0.333... nanoseconds per cycle.
- One clock cycle does not equal one instruction. Instructions (even those that stay entirely within the CPU, not accessing any memory or peripherals) can take multiple cycles to complete. Additionally, multiple instructions can be executing at the same time (and I'm not referring to multiple cores or hyperthreading here, I mean multiple instructions simultaneously executing on a single core, without hyperthreading).
Unlike with Java or C# I can't just use google as well, since Assembly just isn't used by many anymore.
I don't think this is accurate: I found dozens of helpful articles and presentations by searching "understanding assembly language".
Further, you will find the search terms x64 and "instruction set" helpful. The following describes additional search terms you might use to dig deeper.
There are many different kinds of CPUs. Each uses an instruction set architecture.
The instruction set architecture describes the various instructions that the CPU can execute. These instructions have encodings and so are stored as bit patterns. A program consists of sequences of instructions that a given CPU can execute. The language of a program using these bit patterns is called machine code.
Assembly language refers to a human readable version of machine code. Instructions are specified using mnemonic instruction names and operands that can be read and edited as text. Assembly language is compiled (assembled) into machine code by program called an assembler. Assembly language has many features to make the source code readable and more maintainable than machine code. For example, assembly language uses labels — whereas machine code uses offsets. Inserting a new instruction into a program in assembly language is easy, but doing so in machine code is hard as it will throw off other offsets being used nearby. So, assembly language is much preferred.
The Application Binary Interface for a given ISA, determines conventions of register and stack usage, such that one function written by one author can "call" another function written by another author (provided they both adhere to the convention). Another useful term here is "calling convention", which is part of an ABI that specfically describes how parameters are passed from one function to another. Also relevant is the term stack frame.
It is useful to understand that the difference between what is allowed/supported by the hardware ISA and what is allowed/supported by software convention of the ABI.
This Wikipedia article lists the registers on x64 (https://en.wikipedia.org/wiki/X86-64#Architectural_features), and this article illustrates the x86's overlapping register names (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/x64-architecture).
If you don't want to adhere to the standard conventions (ABI), you can create your own, which is often done in simple programs by students learning to write assembly language.
Beyond these search terms, you should consider writing some simple C programs that illustrate your questions, compile them (with and/or without optimization) and look at the compiler output as disassembly or in a debugger to see how the instruction set is used to manipulate data.
NASM is a specific assembler for x86/x64 architecture, but certainly not the only one. Questions specifically about NASM would go to the different syntaxes and expressions you can write in that assembler.
Your question about registers and stack should be directed more to the instruction set architecture and the calling convention than toward any specific assembler. While the instruction set allows certain operations regardless of the operating system, the ABI differs somewhat between linux and windows, so that there are some differences in register usage and stack usage.
A stack frame can use a single stack pointer or both a stack pointer and a frame pointer. The stack pointer can move during the execution of a function, so the offset of stack allocated variables relative to the stack pointer can change. A frame pointer remains fixed during the execution of a function, and thus variables located in the stack can be referred to by a fixed offset from the frame pointer even as the stack pointer moves from pushing and popping.
The frame pointer approach is easier to use and also supports easier debugging, and may also support stack unwinding and exception handling. However, it somewhat less efficient (as it involves a second register, and a few extra instructions to save, establish, and restore the frame pointer).
The x86 architecture has a long linage. If you see RAX, RBX, RSP, RBP, these are names of registers in the 64-bit extension of this architecture. EAX, EBX are names of the 32-bit registers, and you may see these in 32-bit code or 64-bit code. Any given program should be either intended for the 64-bit architecture (x64) or the 32-bit (x86) architecture but not both mixed together. Therefore we can look to how registers like SP and BP are used to see which (RSP/RBP for 64 and ESP/EBP for 32).
In the original 16-bit 8086, AX was a favored register since encodings that target that register are shorter than other instructions. Further, multiplies and divides target AX/DX register pair. Many of these special register uses have been removed in favor of the registers being more general purpose as the architecture has evolved to 32-bits and 64-bits. This evolution is friendlier toward compilers and hence high-level language. These architectures still have dedicated stack pointers and instructions that implicitly target this register. However, the other registers today are general purpose registers. Once again I bring up the calling convention, which will tell you which register is used, for example, to pass the first argument, or to return a return value.
Best Answer
That's a pretty old and mostly obsolete model for virtual memory layout.
In reality instructions and global each start at some separate random location. Linked libraries are mapped separately from the main program to their own random locations.
The heap is created by asking for blank pages as needed. These may or may not be contiguous to previously returned pages of memory.
The first and last chunk of virtual memory is often reserved and marked non-accessible to catch null pointer related bugs.
Even after all that there will be enough free space to reserve space (again in random locations) for the stack of each thread created.