Why does LLVM have an assembly-like IR rather than a tree-like IR? Or: why do projects target LLVM IR instead of clang’s AST

clangcompilerllvmprogramming-languages

Why is LLVM's intermediate representation (LLVM IR) assembly-like rather than tree-like?

Alternatively, why do language implementations target LLVM IR rather than clang's AST?

I'm not trying to ask two different questions at once if it seems that way. To me, it simply seems like both client and library programmers have come to the consensus that LLVM's API, nothing more and nothing less, is obviously good software design and my question is "why?".

The reason I ask is that it seems like LLVM could provide more functionality to frontends if it's IR was AST-like because then clang's AST-based tools could be used for any frontend. Alternatively, languages that target LLVM IR could get more functionality if they targeted clang's AST.

Clang has classes and functions for creating and working with ASTs and it's the only frontend project that's strongly tied to the LLVM project so why is clang's AST-functionality external to LLVM?

Off the top of my head, I know that Rust (rustc), D (ldc), and Haskell (GHC) can all use LLVM as a backend but they don't use the Clang AST (as far as I know, I could be wrong). I don't know all the internal details of these compilers but at least Rust and D certainly seem like they could be compiled to clang's AST. Maybe Haskell could too, but I'm much less certain about that.

Is this because of historical reasons (LLVM originally being a "low-level virtual machine" and clang coming along later)? Is this because other frontends want to have as much control as possible over what they feed to LLVM? Are there fundamental reasons that clang's AST is inappropriate for "non-C-like" languages?

I don't intend this question to be an exercise in mindreading. I just want it to be helpful to those of us who are curious about, but not already fluent in, compiler design. Since the LLVM and clang projects are developed in public, I'm hoping that someone familiar with the development of these projects can answer or that the answer is obvious enough to some compile nerds that they feel confident enough to answer.

To pre-empt some obvious but unsatisfactory answers:

Yes, having an assembly-like IR gives more control to whoever crafts the IR (perhaps X lang has a better codebase and AST format than clang) but if that's the only answer, then the question becomes "why does LLVM only have an assembly-like IR instead of a high level tree-like IR and a low-level assembly-like IR?".

Yes, it's not that hard to parse a programming language into an AST (at least compared to the other steps of compiling). Even so, why use separate ASTs? If nothing else, using the same AST allows you to use tools that operate on ASTs (even just simple things like AST printers).

Yes, I strongly agree that being more modular is a good thing, but if that's the only reason, then why do other language implementations tend to target LLVM IR instead of clang's AST?

These pre-emptions might be erroneous or overlook details, so do feel free to give these answers if you have more details or my assumptions are mistaken.

For anyone wanting to answer a more definitively answerable question: what are the advantages and disadvantages of an assembly-like IR vs a tree-like IR?

Best Answer

There's a number of inter-related questions here, I'll try to separate them as best I can.

Why do other languages build on LLVM IR and not clang AST?

This is simply because clang is a C/C++ front end and the AST it produces is tightly coupled to C/C++. Another language could use it but it would need near identical semantics to some subset of C/C++ which is very limiting. As you point out, parsing to an AST is fairly straightforward so restricting your semantic choices is unlikely to be worth the small saving.

However, if you're writing tooling for C/C++ e.g. static analysers, then re-using the AST makes a lot of sense as it's a lot easier to work with the AST than the raw text iff you're working with C/C++.

Why is LLVM IR the form it is?

LLVM IR was chosen as an appropriate form to write compiler optimisations. As such, it's primary feature is that it's in SSA form. It's quite a low level IR so that it is applicable to a wide range of languages e.g. it doesn't type memory as this varies a lot across languages.

Now, it happens to be the case that writing compiler optimisations is quite a specialist task and is often orthogonal to language feature design. However, having a compiled language run fast is a fairly general requirement. Also, the conversion from LLVM IR to ASM is fairly mechanical and not generally interesting to language designers either.

Therefore, lowering a language to LLVM IR gives a language designer a lot of "free stuff" that is very useful in practice leaving them to concentrate on the language itself.

Would a different IR be useful (OK, not asked but sort of implied)?

Absolutely! ASTs are quite good for certain transformations on the program structure but are very hard to use if you want to transform program flow. An SSA form is generally better. However, LLVM IR is very low level so a lot of the high level structure is lost (on purpose so it's more generally applicable). Having an IR between the AST and the low level IR can be beneficial here. Both Rust and Swift take this approach and have a high level IR between the two.

Related Solutions

Reason to use mingw win32 headers and libs with LLVM/Clang

One reason for the MingW header files to exist is certainly copyright: while they are "free as in beer", you are not allowed to redistribute them. Neither could the LLVM authors redistribute them, instead, every LLVM/Clang user would have to download them on their own. So it would be reasonable for Clang to provide a set of header files, and it would then be reasonable to use the MingW code as a starting point.

In addition, another reason for the MingW header files to exist (IIRC) is that the SDK headers use C extensions that were not supported by gcc. They may be supported now, so that reason may have gone.

Clang warning flags for Objective-C development

For context, I'm a Clang developer working at Google. At Google, we've rolled Clang's diagnostics out to (essentially) all of our C++ developers, and we treat Clang's warnings as errors as well. As both a Clang developer and one of the larger users of Clang's diagnostics I'll try to shed some light on these flags and how they can be used. Note that everything I'm describing is generically applicable to Clang, and not specific to C, C++, or Objective-C.

TL;DR Version: Please use -Wall and -Werror at a minimum on any new code you are developing. We (the compiler developers) add warnings here for good reasons: they find bugs. If you find a warning that catches bugs for you, turn it on as well. Try -Wextra for a bunch of good candidates here. If one of them is too noisy for you to use profitably, file a bug. If you write code that contains an "obvious" bug but the compiler didn't warn about it, file a bug.

Now for the long version. First some background on warning flag groupings. There are a lot of "groupings" of warnings in Clang (and to a limited extent in GCC). Some that are relevant to this discussion:

On-by-default: These warnings are always on unless you explicitly disable them.
-Wall: These are warnings that the developers have high confidence in both their value and a low false-positive rate.
-Wextra: These are warnings that are believed to be valuable and sound (i.e., they aren't buggy), but they may have high false-positive rates or common philosophical objections.
-Weverything: This is an insane group that literally enables every warning in Clang. Don't use this on your code. It is intended strictly for Clang developers or for exploring what warnings exist.

There are two primary criteria mentioned above which guide where warnings go in Clang, and let's clarify what these really mean. The first is the potential value of a particular occurrence of the warning. This is the expected benefit to the user (developer) when the warning fires and correctly identifies an issue with the code.

The second criteria is the idea of false-positive reports. These are situations where the warning fires on code, but the potential problem being cited does not in fact occur due to the context or some other constraint of the program. The code warned about is actually behaving correctly. These are especially bad when the warning was never intended to fire on that code pattern. Instead, it is a deficiency in the warning's implementation that causes it to fire there.

For Clang warnings, the value is required to be in terms of correctness, not in terms of style, taste, or coding conventions. This limits the set of warnings available, precluding oft-requested warnings such as warning whenever {}s are not used around the body of an if statement. Clang is also very intolerant of false-positives. Unlike most other compilers it will use an incredible variety of information sources to prune false positives including the exact spelling of the construct, presence or absence of extra '()', casts, or even preprocessor macros!

Now let's take some real-world example warnings from Clang, and look at how they are categorized. First, a default-on warning:

% nl x.cc
     1  class C { const int x; };

% clang -fsyntax-only x.cc
x.cc:1:7: warning: class 'C' does not declare any constructor to initialize its non-modifiable members
class C { const int x; };
      ^
x.cc:1:21: note: const member 'x' will never be initialized
class C { const int x; };
                    ^
1 warning generated.

Here no flag was required to get this warning. The rationale is that this is code is never really correct, giving the warning high value, and the warning only fires on code that Clang can prove falls into this bucket, giving it a zero false-positive rate.

% nl x2.cc
     1  int f(int x_) {
     2    int x = x;
     3    return x;
     4  }

% clang -fsyntax-only -Wall x2.cc
x2.cc:2:11: warning: variable 'x' is uninitialized when used within its own initialization [-Wuninitialized]
  int x = x;
      ~   ^
1 warning generated.

Clang requires the -Wall flag for this warning. The reason is that there is a non-trivial amount of code out there which has used (for good or ill) the code pattern we are warning about to intentionally produce an uninitialized value. Philosophically, I see no point in this, but many others disagree and the reality of this difference in opinion is what drives the warning under the -Wall flag. It still has very high value and a very low false-positive rate, but on some codebases it is a non-starter.

% nl x3.cc
     1  void g(int x);
     2  void f(int arr[], unsigned int size) {
     3    for (int i = 0; i < size; ++i)
     4      g(arr[i]);
     5  }

% clang -fsyntax-only -Wextra x3.cc
x3.cc:3:21: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (int i = 0; i < size; ++i)
                  ~ ^ ~~~~
1 warning generated.

This warning requires the -Wextra flag. The reason is that there are very large codebases where mis-matched sign on comparisons is extremely common. While this warning does find some bugs, the probability of the code being a bug when the user writes it is fairly low on average. The result is an extremely high false-positive rate. However, when there is a bug in a program due to the strange promotion rules, it is often extremely subtle making this warning when it flags a bug have relatively high value. As a consequence, Clang provides it and exposes it under a flag.

Typically, warnings don't live long outside of the -Wextra flag. Clang tries very hard to not implement warnings which do not see regular use and testing. The additional warnings turned on by -Weverything are usually warnings under active development or with active bugs. Either they will be fixed and placed under appropriate flags, or they should be removed.

Now that we have an understanding of how these things work with Clang, let's try to get back to the original question: what warnings should you turn on for your development? The answer is, unfortunately, that it depends. Consider the following questions to help determine what warnings work best for your situation.

Do you have control over all of your code, or is some of it external?
What are your goals? Catching bugs, or writing better code?
What is your false-positive tolerance? Are you willing to write extra code to silence warnings on a regular basis?

First and foremost, if you don't control the code, don't try turning extra warnings on there. Be prepared to turn some off. There is a lot of bad code in the world, and you may not be able to fix all of it. That is OK. Work to find a way to focus your efforts on the code you control.

Next, figure out what you want out of your warnings. This is different for different people. Clang will try to warn without any options on egregious bugs, or code patterns for which we have long historical precedent indicating the bug rate is extremely high. By enabling -Wall you're going to get a much more aggressive set of warnings targeted at catching the most common mistakes that Clang developers have observed in C++ code. But with both of these the false-positive rate should remain quite low.

Finally, if you're perfectly willing to silence false-positives at every turn, go for -Wextra. File bugs if you notice warnings which are catching a lot of real bugs, but which have silly or pointless false positives. We're constantly working to find ways to bring more and more of the bug-finding logic present in -Wextra into -Wall where we can avoid the false-positives.

Many will find that none of these options is just-right for them. At Google, we've turned some warnings in -Wall off due to a lot of existing code that violated the warning. We've also turned some warnings on explicitly, even though they aren't enabled by -Wall, because they have a particularly high value to us. Your mileage will vary, but will likely vary in similar ways. It can often be much better to enable a few key warnings rather than all of -Wextra.

I would encourage everyone to turn on -Wall for any non-legacy code. For new code, the warnings here are almost always valuable, and really make the experience of developing code better. Conversely, I would encourage everyone to not enable flags beyond -Wextra. If you find a Clang warning that -Wextra doesn't include but which proves at all valuable to you, simply file a bug and we can likely put it under -Wextra. Whether you explicitly enable some subset of the warnings in -Wextra will depend heavily on your code, your coding style, and whether maintaining that list is easier than fixing everything uncovered by -Wextra.

Of the OP's list of warnings (which included both -Wall and -Wextra) only the following warnings are not covered by those two groups (or turned on by default). The first group emphasize why over-reliance on explicit warning flags can be bad: none of these are even implemented in Clang! They're accepted on the command line only for GCC compatibility.

-Wbad-function-cast
-Wdeclaration-after-statement
-Wmissing-format-attribute
-Wmissing-noreturn
-Wnested-externs
-Wnewline-eof
-Wold-style-definition
-Wredundant-decls
-Wsequence-point
-Wstrict-prototypes
-Wswitch-default

The next bucket of unnecessary warnings in the original list are ones which are redundant with others in that list:

-Wformat-nonliteral -- Subset of -Wformat=2
-Wshorten-64-to-32 -- Subset of -Wconversion
-Wsign-conversion -- Subset of -Wconversion

There are also a selection of warnings which are more categorically different. These deal with language dialect variants rather than with buggy or non-buggy code. With the exception of -Wwrite-strings, these all are warnings for language extensions provided by Clang. Whether Clang warns about their use depends on the prevalence of the extension. Clang aims for GCC compatibility, and so in many cases it eases that with implicit language extensions that are in wide use. -Wwrite-strings, as commented on the OP, is a compatibility flag from GCC that actually changes the program semantics. I deeply regret this flag, but we have to support it due to the legacy it has now.

-Wfour-char-constants
-Wpointer-arith
-Wwrite-strings

The remaining options which are actually enabling potentially interesting warnings are these:

-Wcast-align
-Wconversion
-Wfloat-equal
-Wformat=2
-Wimplicit-atomic-properties
-Wmissing-declarations
-Wmissing-prototypes
-Woverlength-strings
-Wshadow
-Wstrict-selector-match
-Wundeclared-selector
-Wunreachable-code

The reason that these aren't in -Wall or -Wextra isn't always clear. For many of these, they are actually based on GCC warnings (-Wconversion, -Wshadow, etc.) and as such Clang tries to mimic GCC's behavior. We're slowly breaking some of these down into more fine-grain and useful warnings. Those then have a higher probability of making it into one of the top-level warning groups. That said, to pick on one warning, -Wconversion is so broad that it will likely remain its own "top level" category for the foreseeable future. Some other warnings which GCC has but which have low value and high false-positive rates may be relegated to a similar no-man's-land.

Other reasons why these aren't in one of the larger buckets include simple bugs, very significant false-positive problems, and in-development warnings. I'm going to look into filing bugs for the ones I can identify. They should all eventually migrate into a proper large bucket flag or be removed from Clang.

I hope this clarifies the warning situation with Clang and provides some insight for those trying to pick a set of warnings for their use, or their company's use.

Best Answer

Related Solutions

Reason to use mingw win32 headers and libs with LLVM/Clang

Clang warning flags for Objective-C development

Related Topic