C Compiler – How a Compiler Works Without Directly Compiling to Machine Code

ccompiler

I know the compilation process goes with this flow:

source -> parse -> AST -> intermediate code -> assembly -> machine code

and in the case of Java you will have bytecode which is translated by JVM.

In languages such as C/C++ and GO, we parse the source code and output direct machine code in a compiled file, easy enough;

But what about other languages, such as Java, JavaScript, Python etc?

Lets say I want to construct my own simple language and write it's Just in time compiler in C, my language is going to be similar to JavaScript, here is a code snippet:

my language:

var a, b = 5;
print(a+b)

My compiler (which is written in C), after it has constructed an AST will try to add these 2 tree nodes together:

int compilerDoAdd(node a, node b){
 return a + b;
}

Here is my question:

Is the above method the normal way to create a compiler? I'm not creating a heap, I'm not creating stack, I'm not creating assembly, I'm not creating machine code or instructions directly anywhere and I'm just letting the C language do all that work for me. My compiler will act just as a parser.

Is this how other languages like Python and JavaScript work? or do they create their own machine instructions (and basically redo whatever C is designed to do)?

Edit. is what I just explained how an Interpreter works?
I have read in many places that An interpreter steps through the source code line by line, figuring out what it’s doing and does it. but it never explains how the does it works, does it turn into machines code? Or it works like how I explained it.

Best Answer

I know the compilation process goes with this flow:

source -> parse -> AST -> intermediate code -> assembly -> machine code

,and in the case of Java you will have bytecode which is translated by JVM.

This is not true for many, many, reasons.

  • Not all compilers produce "machine code" (except for the trivial interpretation that all code in some language is machine code for an abstract machine induced by that language). For example, lots of compilers produce ECMAScript source code: CoffeeScript, TypeScript, PureScript, Elm, Opal, GWT, Emscripten, Babel, Clozure, Scala.js, Kotlin, Clue.
  • Not all compilers use assembly as an intermediate language.
  • Not all compilers use an intermediate language at all, they may generate code directly from the AST, for example.
  • Compilers may have multiple intermediate languages. E.g. in GCC, there are GENERIC, GIMPLE, and RTL, and frontends may also have their own intermediate languages as well.
  • Many compilers have a tokenization step (aka lexical analysis, lexing, scanning) before the parsing step.
  • Many compilers will first produce a parse tree (aka Concrete Syntax Tree / CST) and then transform that to an Abstract Syntax Tree / AST.
  • You are completely missing any form of semantic analysis, e.g. type checking, type inference, name resolution / binding, definite assignment.
  • Most compilers perform optimizations. In fact, the entirety of what is usually called the mid-end is missing from your description.
  • You are not accounting for things like macros, compile-time metaprogramming, compile-time reflection, etc., which may require embedding an interpreter (or another compiler) inside the compiler.

In languages such as C/C++ and GO, we parse the source code and output direct machine code in a compiled file, easy enough;

This is also not true. There is nothing in the language specification of C, C++, or Go that says that it has to be implemented by a compiler that produces machine code. In fact, there is nothing in those specifications that says the language has to be implemented by a compiler at all.

For example, Cint is an interpreter for C. Clue is a compiler for C that outputs Lua, Javascript, Perl, Common Lisp, Java, or C. GCC, one of the most widely-used C compilers does not output machine code either, it actually outputs assembly to be compiled by an assembler. You just don't see it because for performance reasons, modern versions no longer write the assembly to a file, they hand the code off to the assembler directly in memory.

But what about other languages, such as Java, Javascript, Python etc?

Still the same. A language is simply a set of abstract mathematical / logical rules and restrictions. That's it. A language is a piece of paper, written in English.

How you implement this piece of paper is completely your choice. You could write an interpreter. You could write a compiler that outputs ECMAScript. You could write a compiler that outputs C. You could write a compiler that outputs JVM bytecode. You could write a compiler that outputs bytecode for a specialized VM for that language.

The language says nothing about how it is going to be implemented. That is purely a choice for the implementor. You can implement C with an interpreter if you want, and indeed, it has been done. You can implement ECMAScript with a compiler that compiles to native machine code if you want, and indeed, it has been done.

Lets say I want to construct my own simple language and write it's Just in time compiler in C, my language is going to be similar to Javasciprt, here is a code snippet:

my language

var a, b = 5;
print(a+b)

My compiler (which is written in C), after it has constrcuted an AST will try to add these 2 tree nodes together:

int compilerDoAdd(node a, node b){
 return a + b;
}

Here is my question:

Is the above method the normal way to create a compiler?

It depends on what you mean by "normal way to create a compiler". There are many, many, many different ways of creating a compiler. This could in fact be part of a compiler that produces subroutine-threaded code or indirect-threaded code. If you interpret "normal" as "common in mainstream compilers", then no, subroutine-threaded or indirect-threaded code is not used in mainstream compilers. Threaded code is popular in the Forth community, for example, but I would not consider Forth "mainstream" in the general purpose computing market.

It actually looks much more likely that what you have is a part of an interpreter. But it is impossible to say without knowing how this function is called, and what the rest of the code is doing.

I'm not creating a heap, I'm not creating stack, I'm not creating assembly, I'm not creating machine code or instructions directly anywhere and I'm just letting the C language do all that work for me.

Heaps, stacks, assembly, machine code, they are not necessary components of a compiler. A compiler is simply a program which takes a program P written in language X as input and produces a semantically equivalent program P′ written in language Y as output, where "semantically equivalent" means that running P on an interpreter for X should have the exact same result as running P′ on an interpreter for Y.

My compiler will act just as a parser.

Well, then it is not a compiler, nor is it an interpreter. It is a parser.

Is this how other languages like Python and Javascript work? or do they create their own machine instructions (and basically redo whatever C is designed to do)?

Again, this has nothing to do with the language. It has to do with how the particular version of the particular implementation of the language works. For example, the Nashorn implementation of ECMAScript works very different from the V8 implementation of ECMAScript. And the current version of the V8 implementation of ECMAScript works very different than the original version of the V8 implementation of ECMAScript, both of which are again different from some of the versions that came in between.

Edit. is what I just explained how an Interpreter works? I have read in many places that An interpreter steps through the source code line by line,

Only some very simple interpreters do that. In fact, in languages where a single statement can span multiple lines, it isn't even possible to do this. For example, the following snippet of ECMAScript cannot possibly be executed line-by-line, since neither of the lines makes sense on its own:

console.log(
"Hello");

A very simple interpreter may indeed execute code statement-by-statement, but the majority of interpreters create an AST and interpret that. (So-called AST-walking interpreters.) The original interpreter for Ruby was such an interpreter, for example.

Many implementations that are called interpreters are actually combinations of interpreters and compilers. For example, the CPython "Python interpreter" actually compiles Python to CPython bytecode, then interprets that bytecode. So, technically calling it a "Python interpreter" is simply wrong, since it never interprets Python, it interprets a different language. However, calling it a "Python compiler", while technically correct, could also be perceived as misleading.

figuring out what it’s doing and does it. but it never explains how the does it works, does it turn into machines code? or it works like how I explained it.

If it turns it into machine code, then it is not an interpreter, it is a compiler.

Again: a compiler translates from one language to another. An interpreter executes.

To answer also some of your questions from the comments:

Thanks for the great detailed info. So just to confirm, compilerDoAdd in it's current state is an interpreter?,

It very much looks like an interpreter, but it could also be part of a compiler for subroutine-threaded or indirect-threaded code. It is impossible to tell without seeing the rest of the system. The main question is: do you call this function? If you call this function, then it is an interpreter. If you emit code that calls this function, then it is a subroutine-threaded compiler.

by saying an interpreter executes things; this is how it's done? you let your parent language do the work

In some sense this is trivially true. Your interpreter is written in some language, so everything this interpreter does, must somehow be performed by that language.

However, I interpret your question to mean whether there is usually a simple, direct, 1:1 mapping the interpreted language and the host language of the interpreter. The answer is most often: "No". There can only be a simple, direct, 1:1 mapping if the semantics of the two languages are very similar. But if the semantics of the two languages are very similar, then why have two languages? Usually, we create new languages because we want them to be semantically different from existing languages. We want them to be in some sense "better".

The only exception to this are languages that are specifically designed to be semantically identical (or close) but with different syntax. E.g. Vala and Genie are such examples, both Vala and Genie are intended to be semantically identical with Vala having a syntax inspired by C♯ and Genie having a syntax inspired by Python. (Note, however, that in reality, both Vala and Genie are implemented as compilers to C, not as interpreters.)

Languages like TypeScript are also examples of this. TypeScript is designed to be mostly a semantic and syntactic superset of ECMAScript, so for the parts where TypeScript is semantically identical to ECMAScript, a hypothetical TypeScript interpreter implemented in ECMAScript could just use ECMAScript semantics to execute the code. (Note, however, that in reality, TypeScript is implemented as a compiler, not an interpreter, and it is implemented in TypeScript, not ECMAScript.)

Looking at your example, though, you say that your language is "similar to JavaScript". In that case, what you have written is probably not valid, since you are assuming that adding two values in your language behaves the same as adding two ints in C. I can tell you that this is most definitely not the case for JavaScript, and if your language is similar in that aspect, then it is also not the case for yours!

For example, here are some properties of adding ints in C that are not true for adding values in JavaScript:

  • Addition of ints in C is commutative: a + b == b + a. This is not true in JavaScript, e.g. for a = "Hello"; b = "World";.
  • ints in C have a limited range but they are always precise. Numbers in JavaScript have also a limited range, but much larger than typical ints in C. BUT!!! Most importantly, they are variably precise, i.e. numbers very close to 0 as well as very large and very small numbers are imprecise. For example, for a very large number a it can happen that a + 1 == a. This is not true for ints in C.
  • On the other hand, in JavaScript, it is always true that a + 1 >= a, whereas in C, it can happen that a + 1 <= a.

So, it is highly likely that you will have to do more work to implement the + operator than simply to delegate to int addition in C. You will have to check whether a and b are strings, arrays, or numbers, and when they are numbers, you will have to check if the numbers are integers or real numbers, and when they are integers, you will have to check if the numbers and their result fit into a C int or you need e.g. something like a C long or even BigInt math, and generally do some other clever things, and implement your own logic for adding things. Especially if your language is really like JavaScript, where [] + {} is {} and {} + [] is 0.

Here are some other questions and answers on this site that may help you: