Compiler Design – Implementing Naive Register Allocation for x86 Machines

compiler

I'm writing a toy compiler targeting on x86-64 machines. But I confront several problems when implementing register allocation with linear intermediate representation. I use NASM and GCC to generating executable file.

In intermediate representation of the origin code, we assume there are infinite registers(memory is regarded as a kind of virtual register). I divided them into two types:

Single registers: This kind of virtual register represent a typical variable.
Offset registers: This kind of virtual register is used to represent the access to an element of an array or a member of a struct. For example, when translating instructions such like a.x = 3 and b[i] = 3 into intermediate representation, we may represent it as mov [$a + 0], 3 and mov [$b + $i * 4], 3(4 is the size of an integer), where $a, $b and $i are single registers and [$a + 0] and [$b + $i * 4] are offset registers.

However, in x86-64 machine, memory access can only be represented as [$reg1 + $reg2 * imm1 + imm2], where $reg1 and $reg2 are physical registers and imm1 and imm2 are immediate number. I cannot understand how register allocation algorithms deal with the case that the algorithm marks $va and $vb as spilled node with instruction mov [$va + $vb * 4], 3. In other words, $va and $vb must be a physical register rather than memory access, but if a virtual register is marked as spilled node, it will be regarded as a memory access to the stack frame.

For example, I get the following origin C-like code:

int main() {
    int [] a = new int[10];
    int i = 3;
    a[i] = 4;
}

And I translate it into following intermediate representation:

call malloc, 10
mov $a, $retval       ; $retval is the return value of malloc
mov $i, 3
mov [$a + $i * 4], 4

However, after I have allocated registers, I find the final code becomes:

push rbp
mov rbp, rsp
sub rsp, 16
mov rdi, 10
call malloc
mov qword [rbp-8], rax
mov qword [rbp-16], 3
mov [qword [rbp-8] + qword [rbp-16] * 4], 3  <- Compliation error occurs here

I wonder if there is a nice implementation to solve this problem. Thanks in advance.

UPD: Basile-Starynkevitch gave me a paper showing how GCC solve this problem. And I'm looking for some (probably) simpler methods.

Best Answer

The primary problem with your approach is that by lowering your code to the level of assembly instructions before you've allocated values to registers, you lose the ability to vary the assembly you need to produce on the basis of where a value needs to come from. If, however, you defined a more abstract representation that could then be translated down to assembly language after register allocation, that would make matters easier.

Consider a toy expression-calculation language, and the expression a + 2*b + c[a+d]. This might be reduced to an intermediate representation that looks something like this:

v1 <= int32[a]       ; meaning: load the 32-bit int value at address "a"
v2 <= int32[b]
v3 <= 2*v2  (~v2)    ; (~nnn) means we no longer need to track this value. we could
                     ; determine this automatically, but for simplicity it's
                     ; easier to have it defined statically
v4 <= v1 + v3 (~v1,v3)
v5 <= int32[a]
v6 <= int32[d]
v7 <= v5 + v6 (~v5,v6)
v8 <= int32[c + 4*v7] (~v7)
v9 <= v4 + v8 (~v4,v8)

Using a very simple (and somewhat naive) register allocator that can only allocate two registers, EAX and EBX (a restriction put in place so that we can show what happens when you run out of registers), you may start by annotating the locations of each live value:

v1 <= int32[a]              [EAX=v1, EBX=-]
v2 <= int32[b]              [EAX=v1, EBX=v2]
v3 <= 2*v2  (~v2)           [EAX=v1, EBX=v3]
v4 <= v1 + v3 (~v1,v3)      [EAX=v4, EBX=-]
v5 <= int32[a]              [EAX=v4, EBX=v5]
v6 <= int32[d]

... OK, so we don't have a register available here. So we need to temporarily store a value. Because this register allocator is hopelessly naive, it doesn't realise that there are better ways than storing a value that came directly from memory into a temporary memory location. It just picks the value that will not be needed for the longest and adds an extra instruction that stores that into temporary storage:

t1 <= v4                     [EAX=-, EBX=v5, t1=v4] 
v6 <= int32[d]               [EAX=v6, EBX=v5, t1=v4]
v7 <= v5 + v6 (~v5,v6)       [EAX=v7, EBX=-, t1=v4]
v8 <= int32[c + 4*v7] (~v7)  [EAX=v8, EBX=-, t1=v4]
v9 <= v4 + v8 (~v4,v8)

So now it needs to use a value from temporary storage, so it inserts another instruction to pull it back in:

v4 <= t1                     [EAX=v8, EBX=v4, t1=v4]
v9 <= v4 + v8 (~v4,v8)       [EAX=v9]

So now the registers are allocated, we translate the intermediate code into assembly language:

; local variables: a = [ebp+4], b=[ebp+8], c=[ebp+12], d=[ebp+40]
; temporary storage available at [ebp-8]
;v1 <= int32[a]              [EAX=v1, EBX=-]
mov eax, dword [ebp+4]
;v2 <= int32[b]              [EAX=v1, EBX=v2]
mov ebx, dword [ebp+8]
;v3 <= 2*v2  (~v2)           [EAX=v1, EBX=v3]
shl ebx, 1
;v4 <= v1 + v3 (~v1,v3)      [EAX=v4, EBX=-]
add eax, ebx
;v5 <= int32[a]              [EAX=v4, EBX=v5]
mov ebx, dword [ebp + 4]
;t1 <= v4                     [EAX=-, EBX=v5, t1=v4] 
mov dword [ebp-8], eax    
;v6 <= int32[d]               [EAX=v6, EBX=v5, t1=v4]
mov eax, dword [ebp + 40]
;v7 <= v5 + v6 (~v5,v6)       [EAX=v7, EBX=-, t1=v4]
add eax, ebx
;v8 <= int32[c + 4*v7] (~v7)  [EAX=v8, EBX=-, t1=v4]
mov eax, dword [ebp + eax*4 + 12] 
;v4 <= t1                     [EAX=v8, EBX=v4, t1=v4]
mov ebx, dword [ebp-8]
;v9 <= v4 + v8 (~v4,v8)       [EAX=v9]
add eax, ebx

The one remaining question is what would happen if the operation for calculating v8 were instead:

v8 <= int32[c + 33*v7 + 4] (~v7)  [EAX=v8, EBX=-, t1=v4]

Then, we'd need to generate a series of instructions, but that wouldn't be a problem. We know that EAX is available, because it's being clobbered by the result of this computation, so we can use it as temporary storage without needing to go through allocation:

imul eax, 33
mov eax, dword [ebp + eax + 16]