How to Handle Forward References in a Compiler

algorithmscompilercomputer science

I'm creating a very simple compiler PoC that only has

  • Procedures
  • Calls to procedures.

The syntax of the language is really simple:

Main 
{
    Call("Proc1");
}

Proc1 
{
    // Empty
}

I want to create an algorithm to get THIS STRUCTURE (that is, in fact, a directed graph)

Script
{
    Procedures:
    [#1] (Procedure)
    { 
        Name: "Main"
        Statements: 
        {
            (Call): 
            {
                Procedure: #2
            }
                
        }
    }
    [#2] (Procedure) { Name: "Proc1" }      
}

This is, a procedure "Main" that contains a call that holds a reference to the procedure "Proc1" (notice the # symbol to make this reference explicit).

Trying to do this in a simple pass end up with the popular "egg and chicken problem".

My second approach was to first do 2 passes:

  • 1st pass. Scan only the procedure definitions to create a list with them.
  • 2nd pass. Bind the actual procedures (with their call statements). When I find a call, I lookup the previously calculated list of procedure definitions, and link the call to it.

But doing so, what I end up with is calls that have a weak reference (by some indentifier) to the called procedure. But I need hard references to the procedure that is to be called. How can I achieve this?

Extra point

If possible, I would like to use immutable structures

Best Answer

We need to separate the nature of the source language. If cyclic references are forbidden, you can use a topological sort to bring your AST into an order that can be processed without encountering forward references [that haven't been materialized yet]. So i'll focus on cyclic references.

This should be the minimal example that illustrates the point in question:

SomeProc {
  Call("SomeProc")
}

For illustration purposes, this is the AST that this should yield:

Procedures:
  - Name: SomeProc
    Statements:
      - Type: Invocation
        Procedure: SomeProc

This AST already contains a cyclic reference; its just that its a logical one, not one that you can use in the compiler source code. If you want to turn that into an actual reference (one usable in the compiler program), you first need a place to store that reference. And since you cannot create both at the same time, some mutability is necessary. That's just the nature of forward references.

As to practical advise: i've written some interpreters for DSLs, and i've always taken this approach:

You have one set of classes that cleanly represents the AST, and just that. You have another set of classes that represents the semantical structure of the program (an intermediate representation/IR in compiler lingo). Instances of the IR classes created from the instances of the AST classes. Then, once you have IR instances, you resolve all references, mutably. E.g.:

class Context {
  Map<String, Procedure> procedures;
}

class IRProcedure {
  List<Statement> statements;
  
  void resolveReferences(Context context) {
    foreach (statement in statements) {
      statement.resolveReferences(context)
    }
  }
}
class Statement {
  String procedureName;
  Procedure procedure;

  void resolveReferences(Context context) {
    Procedure resolved = context.procedures.get(prcedureName);
    if (resolved == null) {
      // trigger compiler error: unknown function
    }
    this.procedure = resolved;
  }
}

The main flow of the compiler would then:

  1. parse the AST
  2. create Procedure instances for every procedure it finds.
  3. Create the context where procedures are indexed by name
  4. Call resolveReferences(context) on every procedure

And voilá, references set correctly.

Unfortunately, i don't see a way to have forward references be immutable if cyclic references are a possibility. If things get more involved (type checking, data structures, control flow, yada yada), then you can gain a lot in terms of immutability by introducing a third set of classes / another IR. But that goes beyond the scope of the question.

The number of passes

I have no clue how the number of passes are counted. I think two is reasonable if you reason like so:

  • Parsing the code into AST: Pass 1
  • Walking over the entire AST to resolve references: Pass 2

For each IR you introduce to help you deal with complexity the pass count arguably goes up by 1. But one could also argue that one walk over the AST/IR tree doesn't count as a pass, since you're only modifying the existing data structures already in memory. So if you have to restrain yourself to two passes, you can maybe use another IR with that argument.

Related Topic