Writing a JIT compiler in assembly

assemblyccode generationjitself-modifying

I've written a virtual machine in C which has decent performance for a non-JIT VM, but I want to learn something new, and improve performance. My current implementation simply uses a switch to translate from VM bytecode to instructions, which is compiled to a jump table. Like I said, decent performance for what it is, but I've hit a barrier that can only be overcome with a JIT compiler.

I've already asked a similar question not long ago about self-modifying code, but I came to realize that I wasn't asking the right question.

So my goal is to write a JIT compiler for this C virtual machine, and I want to do it in x86 assembly. (I'm using NASM as my assembler) I'm not quite sure how to go about doing this. I'm comfortable with assembly, and I've looked over some self-modifying code examples, but I haven't come to figure out how to do code generation just yet.

My main block so far is copying instructions to an executable piece of memory, with my arguments. I'm aware that I can label a certain line in NASM, and copy the entire line from that address with the static arguments, but that's not very dynamic, and doesn't work for a JIT compiler. I need to be able to interpret the instruction from bytecode, copy it to executable memory, interpret the first argument, copy it to memory, then interpret the second argument, and copy it to memory.

I've been informed about several libraries that would make this task easier, such as GNU lightning, and even LLVM. However, I'd like to write this by hand first, to understand how it works, before using external resources.

Are there any resources or examples this community could provide to help me get started on this task? A simple example showing two or three instructions like "add" and "mov" being used to generate executable code, with arguments, dynamically, in memory, would do wonders.

Best Answer

I wouldn't recommend writing a JIT in assembly at all. There are good arguments for writing the most frequently executed bits of the interpreter in assembly. For an example of how this looks like see this comment from Mike Pall, the author of LuaJIT.

As for the JIT, there are many different levels with varying complexity:

Compile a basic block (a sequence of non-branching instructions) by simply copying the interpreter's code. For example, the implementations of a few (register-based) bytecode instructions might look like this:
```
; ebp points to virtual register 0 on the stack
instr_ADD:
    <decode instruction>
    mov eax, [ebp + ecx * 4]  ; load first operand from stack
    add eax, [ebp + edx * 4]  ; add second operand from stack
    mov [ebp + ebx * 4], eax  ; write back result
    <dispatch next instruction>
instr_SUB:
    ... ; similar
```
So, given the instruction sequence ADD R3, R1, R2, SUB R3, R3, R4 a simple JIT could copy the relevant parts of the interpreters implementation into a new machine code chunk:
```
    mov ecx, 1
    mov edx, 2
    mov ebx, 3
    mov eax, [ebp + ecx * 4]  ; load first operand from stack
    add eax, [ebp + edx * 4]  ; add second operand from stack
    mov [ebp + ebx * 4], eax  ; write back result
    mov ecx, 3
    mov edx, 4
    mov ebx, 3
    mov eax, [ebp + ecx * 4]  ; load first operand from stack
    sub eax, [ebp + edx * 4]  ; add second operand from stack
    mov [ebp + ebx * 4], eax  ; write back result
```
This simply copies the relevant code, so we need to initialise the registers used accordingly. A better solution would be to translate this directly into machine instructions mov eax, [ebp + 4], but now you already have to manually encode the requested instructions.

This technique removes the overheads of interpretation, but otherwise does not improve efficiency much. If the code is executed for only one or two times, then it may not worth it to first translate it to machine code (which requires flushing at least parts of the I-cache).
While some JITs use the above technique instead of an interpreter, they then employ a more complicated optimisation mechanism for frequently executed code. This involves translating the executed bytecode into an intermediate representation (IR) on which additional optimisations are performed.

Depending on the source language and the type of JIT, this can be very complex (which is why many JITs delegate this task to LLVM). A method-based JIT needs to deal with joining control-flow graphs, so they use SSA form and run various analyses on that (e.g., Hotspot).

A tracing JIT (like LuaJIT 2) only compiles straight line code which makes many things easier to implement, but you have to be very careful how you pick traces and how you link multiple traces together efficiently. Gal and Franz describe one method in this paper (PDF). For another method see the LuaJIT source code. Both JITs are written in C (or perhaps C++).

Related Solutions

What does a just-in-time (JIT) compiler do

A JIT compiler runs after the program has started and compiles the code (usually bytecode or some kind of VM instructions) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently.

This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run.

To paraphrase, conventional compilers build the whole program as an EXE file BEFORE the first time you run it. For newer style programs, an assembly is generated with pseudocode (p-code). Only AFTER you execute the program on the OS (e.g., by double-clicking on its icon) will the (JIT) compiler kick in and generate machine code (m-code) that the Intel-based processor or whatever will understand.

When is assembly faster than C?

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).

Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.

Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)

Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.

Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Best Answer

Related Solutions

What does a just-in-time (JIT) compiler do

When is assembly faster than C?

Related Topic