C – Why Are There So Few C Compilers?

ccompilerimplementations

C is one of the most widely-used languages in the world. It accounts for a huge proportion of existing code and continues to be used for a vast amount of new code. It's beloved by its users, it's so widely ported that being able to run C is to many the informal definition of a platform, and is praised by its fans for being a "small" language with a relatively clean set of features.

So where are all the compilers?

On the desktop, there are (realistically) two: GCC and Clang. Thinking about it for a few seconds you'll probably remember Intel exists as well. There are a handful of others, far too obscure for the average person to name and almost universally not bothering to support a recent language version (or often even a well-defined language subset, just "a subset"). Half of the members of this list are historical footnotes; most of the rest are very specialized and still don't actually implement the full language. Very few actually seem to be open-source.

Scheme and Forth – other small languages that are beloved by their fans for it – probably have more compilers than actual users. Even something like SML has more "serious" implementations to choose between than C. Whereas the announcement of a new (unfinished) C compiler aiming at verification actually sees some pretty negative responses, and veteran implementations struggle to get enough contributors to even catch up to C99.

Why? Is implementing C so hard? It isn't C++. Do users simply have a very skewed idea about what complexity group it falls in (i.e. that it actually is closer to C++ than Scheme)?

Best Answer

Today, you need a real C compiler to be an optimizing compiler, notably because C is no longer a language close to the hardware, because current processors are incredibly complex (out-of-order, pipelined, superscalar, with complex caches & TLB, hence needing instruction scheduling, etc...). Today's x86 processors are not like i386 processors of the previous century, even if both are able to run the same machine code. See the C is not a low level language (Your computer is not a fast PDP-11) paper by David Chisnall.

Few people are using naive non-optimizing C compilers like tinycc or nwcc, since they produce code which is several times slower than what optimizing compilers can give.

Coding an optimizing compiler is difficult. Notice that both GCC and Clang are optimizing some "source language-neutral" code representation (Gimple for GCC, LLVM for Clang). The complexity of a good C compiler is not in the parsing phase!

In particular, making a C++ compiler is not much harder than making a C compiler: parsing C++ and transforming it into some internal code representation is complex (because the C++ specification is complex), but is well understood, but the optimization parts are even more complex (inside GCC: the middle-end optimizations, source-language and target-processor neutral, form the majority of the compiler, with the rest being balanced between front-ends for several languages and back-ends for several processors). Hence most optimizing C compilers are also able to compile some other languages, like C++, Fortran, D, ... The C++ specific parts of GCC are about 20% of the compiler...

Also, C (or C++) is so widely used that people expect their code to be compilable even when it does not exactly follow the official standards, which do not define precisely enough the semantics of the language (so each compiler may have its own interpretation of it). Look also into the CompCert proved C compiler, and the Frama-C static analyzer, which care about more formal semantics of C.

And optimizations are a long-tail phenomenon: implementing a few simple optimizations is easy, but they won't make a compiler competitive! You need to implement a lot of different optimizations, and to organize and combine them cleverly, to get a real-world compiler that is competitive. In other words, a real-world optimizing compiler has to be a complex piece of software. BTW, both GCC and Clang/LLVM have several internal specialized C/C++ code generators. And both are huge beasts (several millions of source lines of code, with a growth rate of several percent each year) with a large developer community (a few hundred persons, working mostly full-time, or at least half-time).

Notice that there is no (to the best of my knowledge) multi-threaded C compiler, even if some parts of a compiler could be run in parallel (e.g. intra-procedural optimization, register allocation, instruction scheduling... ). And parallel build with make -j is not always enough (especially with LTO).

Also, it is difficult to get funded on coding a C compiler from scratch, and such an effort needs to last several years. Finally, most C or C++ compilers are free software today (there is no longer a market for new proprietary compilers sold by startups) or at least are monopolistic commodities (like Microsoft Visual C++), and being a free software is nearly required for compilers (because they need contributions from many different organizations).

I'd be delighted to get funding to work on a C compiler from scratch as free software, but I am not naive enough to believe that is possible today!

Related Topic