x86 Architecture – Meaning of ‘R’ in x64 Register Names

Architecturex86

I know the 32 bit registers were named like the 16 bit registers with an 'E' prefix to mean extended. I've always assumed that meant extended from 16 to 32 bits although I've never seen that explicitly stated.

I was trying to find out what the 'R' stands for but my google skills have failed me. Anyone know?

Best Answer

It means register, and it isn't all for historical reasons.

The historical part is that Intel got itself into the habit of enumerating registers with letters with the 8008 (A through E plus H and L). That scheme was more than adequate at the time because microprocessors had very few registers and weren't likely to get more, and most designs did it. The prevailing sentiment then was that software would be rewritten for new CPUs as they appeared, so changing the register naming scheme between models wouldn't have been a big deal. Nobody foresaw the 8088 evolving into a "family" after being incorporated into the IBM PC, and the yoke of backward compatibility pretty much forced Intel into having to adopt schemes like the "E" on 32-bit registers to maintain it.

The non-historical part is all practical. Using letters for general-purpose registers limits you to 26, fewer if you weed out those that might cause confusion with the names of special-purpose registers like the program counter, flags or the stack pointer.

I don't have a source to confirm it, but I suspect the choice of R as a prefix and the introduction of R8 through R15 on 64-bit CPUs signals a transition to numbered registers, which have been the norm among 32-bit-and-larger architectures not derived from the 8008 for almost half a century. IBM did it in the 1960s with the 360 and has been followed by the PowerPC, DEC Alpha, MIPS, SPARC, ARM, Intel's i860 and i960 and a bunch of others that are long-forgotten.

You'll note that the existing registers would fit nicely into R0 through R7 if they existed, and it wouldn't surprise me a bit if they're treated that way internally. The existing long registers (RAX/EAX/AX/AL, RBX/EBX/BX/BL, etc.) will probably stay around until the sun burns out.

Related Solutions

CPU Architecture – Accumulator-Based vs Register-Based

A register-based CPU architecture has one or more general purpose registers (where "general purpose register" excludes special purpose registers, like stack pointer and instruction pointer).

An accumulator-based CPU architecture is a register-based CPU architecture that only has one general purpose register (the accumulator).

The main advantage/s of "more that one general purpose register" is that the compiler doesn't have to "spill" as many temporary values onto the stack; and it's easier for the CPU to do more independent instruction in parallel.

For an example imagine you want to do a = (b - c) + (d - f) + 123. For an "apples vs apples comparision" I'll use Intel syntax 32-bit 80x86 assembly for both examples (but only use EAX for the accumulator-based CPU architecture).

For accumulator-based CPU architecture this may be:

    mov eax,[b]     ;Group 1

    sub eax,[c]     ;Group 2

    add eax,123     ;Group 3

    mov [a],eax     ;Group 4
    mov eax,[d]

    sub eax,[e]     ;Group 5

    add [a],eax     ;Group 6

Note that most of these instructions depend on the result from the previous instruction, and therefore can't be done in parallel. The ";Group N" comments are there to indicate which groups of instructions can be done in parallel (and show that, assuming some form of internal "register renaming" ability, "group 4" is the only group where 2 instructions are likely to be done in parallel).

Using multiple registers might give you:

    mov eax,[b]           ;Group 1
    mov ebx,[d]

    sub eax,[c]           ;Group 2
    sub ebx,[e]

    lea eax,[eax+ebx+123] ;Group 3        

    mov [a],eax           ;Group 4

In this case, there's one less instruction, and 2 less groups of instructions (more instructions likely to by done in parallel). That might mean "25% faster" in practice.

Of course in practice code does more than a relatively simple calculation; so there's even more chance of "more instructions in parallel". For example; with only 2 more registers (e.g. ECX and EDX) it should be easy to see that you could do a = (b - c) + (d - f) + 123 and g = (h - i) + (j - k) + 456 in the same amount of time (by doing both calculations in parallel with different registers); and it should also be easy to see that for accumulator-based CPU architecture you can't do the calculations in parallel (two calculations would take twice as long as one calculation).

Note: There is at least one "potential technical inaccuracy" in what I've written here (mostly involving the theoretical capabilities of register renaming and it's application on accumulator-based CPU architectures). This is deliberate. I find that going into too much detail (in an attempt to be "100% technically correct" and cover all the little corner cases) makes it significantly harder for people to understand the relevant parts.

x86 – How to Align on Both Word Size and Cache Lines

To understand how alignment affects things, let's look at a larger context.

First, as you note, 2600 bytes of UTF-8 (or any kind of data) will indeed take 2600 bytes.

If you allocate 2600 bytes from the heap using malloc(2600) e.g. in C, then since malloc does not accept alignment information, it will not know that you're intent is to store only individual bytes — it assumes worst case, which is that you're using the memory for the largest native type that the processor supports. In the case of 64-bit processor that is going to be 16 bytes, which is rather large.

So, the memory allocator locates free memory that matches the 16-byte alignment (and at least 2600 bytes in length). A later memory allocation via malloc will also be rounded up to 16-byte alignment, so there will be a small gap between the 2600 byte chunk and the next memory block returned by malloc, because 2600 is an exact multiple of 8 but not of 16. (There are also potentially other overheads associate with each malloc block as well.)

Both Linux & Windows offer an aligned malloc; however, Linux explicit states that the minimum alignment is pointer size. Even on Windows, which doesn't say, it is clear from the documentation that the authors expect larger alignments to be requested, not smaller alignments.

C will create structures with proper field alignment for the target platform, meaning that it will insert unused pad bytes within a struct if the preceding fields do not layout such that the proper alignment can be had. For example:

struct S {
   char c;
   int  i;
}

Struct S declares c, as a 1-byte item, and it will be at offset 0 in the struct. The field i, is, let's say, a 4-byte item. After c the next available offset is 1 but that is not suitably aligned for a 4-byte value, so the compiler will insert 3 padding bytes and use offset 4 for i, making the size of the struct sizeof(struct S) is 8, even though it only stores 5 bytes of information.

Let's also talk about endian-ness. If you have a string of bytes, then each succesive byte is stored at the next higher byte address (just add 1 to the address to get to the next one) — However, big-endian machines store the 4 bytes needed to make up a 4-byte word reversed from little-endian machines. So, if you wanted to use word-sized access on your string "abcd", you would see a difference between a big and little-endian machine: the big-endian machine would give you 'abcd' whereas the little-endian machine will give you 'dcba'.

In summary, it is generally best not to use the same memory both as bytes and as words (at the same time): if it holds bytes, then use byte-sized access, and if it holds words, then use word-sized access. Note this will naturally happen unless you do "bad things" like cast pointers to a type other than what they originally pointed to. (There are times when you might need to, and, authors of routines like memcpy and memmove play some tricks for performance.) We can also note that it is not even possible to mix byte-sized and word-sized accesses (for the same data/object/array) in a language like Java (without resorting to serialization) since it doesn't offer the low-level feature of casting pointers.

The compiler and runtime (e.g. malloc) cooperate to make sure your data is properly aligned (perhaps even if over-aligned). For example, the stack, before main, should be initially aligned (by the runtime) to at least a 16-byte boundary, and then the compiler can create stack frames that are rounded up to a 16-byte size so the stack and all local variables remain aligned during function calls. Global variables get similar treatment, and we have already discussed heap allocations.

Best Answer

Related Solutions

CPU Architecture – Accumulator-Based vs Register-Based

x86 – How to Align on Both Word Size and Cache Lines

Related Topic