Why does LLVM have an assembly-like IR rather than a tree-like IR? Or: why do projects target LLVM IR instead of clang’s AST

clangcompilerllvmprogramming-languages

Why is LLVM's intermediate representation (LLVM IR) assembly-like rather than tree-like?

Alternatively, why do language implementations target LLVM IR rather than clang's AST?

I'm not trying to ask two different questions at once if it seems that way. To me, it simply seems like both client and library programmers have come to the consensus that LLVM's API, nothing more and nothing less, is obviously good software design and my question is "why?".

The reason I ask is that it seems like LLVM could provide more functionality to frontends if it's IR was AST-like because then clang's AST-based tools could be used for any frontend. Alternatively, languages that target LLVM IR could get more functionality if they targeted clang's AST.

Clang has classes and functions for creating and working with ASTs and it's the only frontend project that's strongly tied to the LLVM project so why is clang's AST-functionality external to LLVM?

Off the top of my head, I know that Rust (rustc), D (ldc), and Haskell (GHC) can all use LLVM as a backend but they don't use the Clang AST (as far as I know, I could be wrong). I don't know all the internal details of these compilers but at least Rust and D certainly seem like they could be compiled to clang's AST. Maybe Haskell could too, but I'm much less certain about that.

Is this because of historical reasons (LLVM originally being a "low-level virtual machine" and clang coming along later)? Is this because other frontends want to have as much control as possible over what they feed to LLVM? Are there fundamental reasons that clang's AST is inappropriate for "non-C-like" languages?

I don't intend this question to be an exercise in mindreading. I just want it to be helpful to those of us who are curious about, but not already fluent in, compiler design. Since the LLVM and clang projects are developed in public, I'm hoping that someone familiar with the development of these projects can answer or that the answer is obvious enough to some compile nerds that they feel confident enough to answer.


To pre-empt some obvious but unsatisfactory answers:

Yes, having an assembly-like IR gives more control to whoever crafts the IR (perhaps X lang has a better codebase and AST format than clang) but if that's the only answer, then the question becomes "why does LLVM only have an assembly-like IR instead of a high level tree-like IR and a low-level assembly-like IR?".

Yes, it's not that hard to parse a programming language into an AST (at least compared to the other steps of compiling). Even so, why use separate ASTs? If nothing else, using the same AST allows you to use tools that operate on ASTs (even just simple things like AST printers).

Yes, I strongly agree that being more modular is a good thing, but if that's the only reason, then why do other language implementations tend to target LLVM IR instead of clang's AST?

These pre-emptions might be erroneous or overlook details, so do feel free to give these answers if you have more details or my assumptions are mistaken.


For anyone wanting to answer a more definitively answerable question: what are the advantages and disadvantages of an assembly-like IR vs a tree-like IR?

Best Answer

There's a number of inter-related questions here, I'll try to separate them as best I can.

Why do other languages build on LLVM IR and not clang AST?

This is simply because clang is a C/C++ front end and the AST it produces is tightly coupled to C/C++. Another language could use it but it would need near identical semantics to some subset of C/C++ which is very limiting. As you point out, parsing to an AST is fairly straightforward so restricting your semantic choices is unlikely to be worth the small saving.

However, if you're writing tooling for C/C++ e.g. static analysers, then re-using the AST makes a lot of sense as it's a lot easier to work with the AST than the raw text iff you're working with C/C++.

Why is LLVM IR the form it is?

LLVM IR was chosen as an appropriate form to write compiler optimisations. As such, it's primary feature is that it's in SSA form. It's quite a low level IR so that it is applicable to a wide range of languages e.g. it doesn't type memory as this varies a lot across languages.

Now, it happens to be the case that writing compiler optimisations is quite a specialist task and is often orthogonal to language feature design. However, having a compiled language run fast is a fairly general requirement. Also, the conversion from LLVM IR to ASM is fairly mechanical and not generally interesting to language designers either.

Therefore, lowering a language to LLVM IR gives a language designer a lot of "free stuff" that is very useful in practice leaving them to concentrate on the language itself.

Would a different IR be useful (OK, not asked but sort of implied)?

Absolutely! ASTs are quite good for certain transformations on the program structure but are very hard to use if you want to transform program flow. An SSA form is generally better. However, LLVM IR is very low level so a lot of the high level structure is lost (on purpose so it's more generally applicable). Having an IR between the AST and the low level IR can be beneficial here. Both Rust and Swift take this approach and have a high level IR between the two.

Related Topic