General Rules for Writing a Compiler in Python

compilerpython

Suppose X is the input language, Z is the output language, then f is the compiler, which is written in language Y.

f = X -> Z

Since f is only a program, I think Y can be any language, right? So we can have compilers f1, f2, each written in Y1, Y2.

f1 = f Y1    
f2 = f Y2

g = Z -> M
h = g . f    # We get a compiler X -> M

Take cpython compiler for example, X is Python, Z is the Python VM code, Y is C.

cpython = Python -> PythonVMCode C
interpreter = PythonVMCode -> Nothing
interpreter2 = PythonVMCode -> MachineCode

Python sources are compiled to the Python VM code, the .pyc files, then interpreted by the interpreter. Looks like it's possible there exists a compiler which can directly do Python -> MachineCode, though much hard to implement:

   hardpython = interpreter2 . cpython 

We can also write another compiler do the work Python -> PythonVMCode, in another language, say Python itself.

mypython = Python -> PythonVMCode Python
mypython2 = Python -> PythonVMCode Ruby

Now, here is the complicated example PyPy. I'm just a newbie of PyPy, correct me if I'm wrong:

PyPy doc http://doc.pypy.org/en/latest/architecture.html#pypy-the-translation-framework

Our goal is to provide a possible solution to the problem of language implementers:
having to write l * o * p interpreters for l dynamic languages and p platforms with o
crucial design decisions.

We can think l is X, p is Y. There exists a program which translates all RPython programs to C:

 rpython_compiler = RPython -> C  Python

 pypy = Python -> Nothing RPython

 translate = compile the program pypy written in RPython using rpython_compiler

 py2rpy = Python -> RPython  Python
 py2c = Python -> C Python 
 py2c = rpython_compiler . py2rpy

RPython programs are just like VM instructions, rpython_compiler is the VM.

q1. pypy is the interpreter, a RPython program which can interpret Python code, there is no output language, so we can't consider it as a compiler, right?

Added:

  • I just found that even if after the translating, pypy is still a interpreter, only this time written in C.
  • If we look deep into the interpreter pypy, I believe there must exist some kind of compiler, which compiles the Python sources to some AST, then execute

like this:

compiler_inside_pypy = Python -> AST_or_so

q2. Can compiler py2rpy exist, transforming all Python programs to RPython? In which language it's written is irrelevant. If yes, we get another compiler py2c. What's the difference between pypy and py2rpy in nature? Is py2rpy much harder to write than pypy?

q3. Is there some general rules or theory available about this?

More compilers:

gcc_c = C -> asm? C  # not sure, gimple or rtl?
g++ =   C++ -> asm? C
clang = C -> LLVM_IR  C++
jython = Python -> JVMCode java
ironpython = Python -> CLI C#

q4. Given f = X -> Z, a program P written in X. When we want to speed up P, what can we do? Possbile ways:

  • rewrite P in more efficient algorithm

  • rewrite f to generate better Z

  • if Z is interpreted, write a better Z interpreter (PyPy is at here?)

  • speed up programs written in Z recursively

  • get a better machine

ps. This question is not about the tech stuffs of how to write a compiler, but the feasibility and complexity of write a certain kind compiler.

Best Answer

q1. pypy is the interpreter, a RPython program which can interpret Python code, there is no output language, so we can't consider it as a compiler, right?

PyPy is similar to CPython, both has a compiler+interpreter. CPython has a compiler written in C that compiles Python to Python VM bytecode then executes the bytecode in an interpreter written in C. PyPy has a compiler written in RPython that compiles Python to Python VM bytecode, then executes it in PyPy Interpreter written in RPython.

q2. Can compiler py2rpy exist, transforming all Python programs to RPython? In which language it's written is irrelevant. If yes, we get another compiler py2c. What's the difference between pypy and py2rpy in nature? Is py2rpy much harder to write than pypy?

Can a compiler py2rpy exists? Theoretically yes. Turing completeness guarantees so.

One method to construct py2rpy is to simply include the source code of a Python interpreter written in RPython in the generated source code. An example of py2rpy compiler, written in Bash:

// suppose that /pypy/source/ contains the source code for pypy (i.e. Python -> Nothing RPython)
cp /pypy/source/ /tmp/py2rpy/pypy/

// suppose $inputfile contains an arbitrary Python source code
cp $inputfile /tmp/py2rpy/prog.py

// generate the main.rpy
echo "import pypy; pypy.execfile('prog.py')" > /tmp/py2rpy/main.rpy

cp /tmp/py2rpy/ $outputdir

now whenever you need to translate a Python code to RPython code, you call this script, which produces -- in the $outputdir -- an RPython main.rpy, the RPython's Python Interpreter source code, and a binary blob prog.py. And then you can execute the generated RPython script by calling rpython main.rpy.

(note: since I'm not familiar with rpython project, the syntax for calling the rpython interpreter, the ability to import pypy and do pypy.execfile, and the .rpy extension is purely made up, but I think you get the point)

q3. Is there some general rules or theory available about this?

Yes, any Turing Complete language can theoretically be translated to any Turing Complete language. Some languages may be much more difficult to translate than other languages, but if the question is "is it possible?", the answer is "yes"

q4. ...

There is no question here.