How are “registers” implemented in VMs

implementationsperformancevirtual machine

Process VMs (such as the Oracle JVM, CPython, .NET CLR etc.) are usually stack-based or register-based.

Are the "registers" in a register-based VM actually the registers of the underlying physical CPU? Or are they simply allocated memory bytes in RAM? How is this usually implemented?

And what about a stack-based VM: Is there direct utilization of specific physical CPU registers, or are the VM memory slots simply allocated memory buffers in RAM?

How does this relate to the register-based VM's supposedly being faster than stack-based ones?

Best Answer

JVM and .NET both employ "JIT"'s.

These JITs take their respective byte code (both stack-based) and translate them into machine code for the processor they're on.  Thus, the same hardware that is available for C and C++ to use is at the disposal of the JIT, and mostly used pretty well.  Some of these lack the power of LLVM for generating code, but they have access to and use the full CPU register set, whether x86, x64 (more registers) or other.  When JITing, the stack is analyzed and some stack locations are given only CPU registers, while others are given memory locations (on the thread stack).

This mapping is facilitated by a weakening of these VM's stack model, such that the stack actually can be removed from the translated machine code.  The way this works is that their stacks are restricted to providing arguments for operations, and cannot be used like a real stack for dynamically pushing and popping.  For example, you cannot write a loop the pushes items on the stack, even if you write a later loop that pops them off.  The pushes & pops need to be (locally) balanced.  This enables total elimination of the VM's stack at runtime in the translated machine code — because there can be a 1:1 mapping between the VM stack locations and physical resources of CPU registers and (machine code) thread stack locations — there is otherwise no need to manifest any notion of the VM stack.  Put another way, the stack in these VM's is not an full blown/traditional data structure like a real "stack" data structure you might find in high level language programming, which can hold an arbitrary amount of data for an arbitrary lifetime.

A register-VM -based JIT would convert register assignments into machine code that accesses the CPU registers directly similar to stack-based JITs.  Similarly, some VM registers would be mapped directly and only to CPU registers, while others would be mapped to memory locations (on the thread stack).

There are some implementations of VM's that interpret rather than JIT.  For them, a hand optimized primary interpreter loop is probably the best they can do.  With hand optimized assembly we can make some improvements in jump tables, some register usage spanning a large interpreter switch statement, and such.

Broadly speaking, it is difficult to use a large number of CPU registers as a stack, because unlike memory, the CPU registers are not addressable — they cannot be indexed like memory (or an array); they can only be named in instructions.  You can cache the top few stack items in CPU registers, though while reducing some overhead, would add other, leading to diminishing returns for the greater number of CPU registers used for the VM stack.  (Under the covers of the hardware, the CPU's highly parallelized hardware interpreter loop can and does address the registers as an array (sometimes called the register file) using an index number, but this hardware capability is not exposed for the machine code instructions to use.)

Given JIT technology, there's no reason for a register-based VM bytecode to be faster than a stack-based bytecode as the objective with both is to remove the bytecode and replace it with native machine code for that processor.

An interpreter that runs register-based byte code would in some cases execute a smaller number of larger instructions, which could equate to faster interpreter execution, since the basic instruction fetch, decode, and execute loop is one of the main bottlenecks in software interpretation.  For simple code sequences, register-based machines typically use larger instruction, e.g. instead of the sequence Push a; Push b; Add; Pop c, you might have a simpler Add c,a,b instruction.  However, such an advantage is not consistent across different workloads, as stacks have advantages in complex expression calculations and also parameter passing (which is a big deal as function calls are numerous).

(Sophisticated hardware can see a sequence like Push a; Push b; Add; Pop c, and execute that as a single fused instruction, so the apparent disadvantage can also dissipate at the hardware level.)

Related Topic