q1. pypy is the interpreter, a RPython program which can interpret Python code, there is no output language, so we can't consider it as a compiler, right?
PyPy is similar to CPython, both has a compiler+interpreter. CPython has a compiler written in C that compiles Python to Python VM bytecode then executes the bytecode in an interpreter written in C. PyPy has a compiler written in RPython that compiles Python to Python VM bytecode, then executes it in PyPy Interpreter written in RPython.
q2. Can compiler py2rpy exist, transforming all Python programs to RPython? In which language it's written is irrelevant. If yes, we get another compiler py2c. What's the difference between pypy and py2rpy in nature? Is py2rpy much harder to write than pypy?
Can a compiler py2rpy exists? Theoretically yes. Turing completeness guarantees so.
One method to construct py2rpy
is to simply include the source code of a Python interpreter written in RPython in the generated source code. An example of py2rpy compiler, written in Bash:
// suppose that /pypy/source/ contains the source code for pypy (i.e. Python -> Nothing RPython)
cp /pypy/source/ /tmp/py2rpy/pypy/
// suppose $inputfile contains an arbitrary Python source code
cp $inputfile /tmp/py2rpy/prog.py
// generate the main.rpy
echo "import pypy; pypy.execfile('prog.py')" > /tmp/py2rpy/main.rpy
cp /tmp/py2rpy/ $outputdir
now whenever you need to translate a Python code to RPython code, you call this script, which produces -- in the $outputdir -- an RPython main.rpy
, the RPython's Python Interpreter source code, and a binary blob prog.py. And then you can execute the generated RPython script by calling rpython main.rpy
.
(note: since I'm not familiar with rpython project, the syntax for calling the rpython interpreter, the ability to import pypy and do pypy.execfile, and the .rpy extension is purely made up, but I think you get the point)
q3. Is there some general rules or theory available about this?
Yes, any Turing Complete language can theoretically be translated to any Turing Complete language. Some languages may be much more difficult to translate than other languages, but if the question is "is it possible?", the answer is "yes"
q4. ...
There is no question here.
The general answer is no, C language compilers are not compatible with each other. The C language standard does not define any kind of binary interoperability, and most compiler writers don't even try.
I need to qualify that. The objects emitted by a C compiler have to be linked with runtime libraries to produce either an executable or a runtime linkable library. Although the visible functions provided by the C runtime library should be compatible, there will also be non-visible functions that are unique to the implementation and prevent interoperability.
This lack of compatibility also extends to different versions of the same compiler. In general, programs and libraries compiled with older and newer versions of a compiler cannot be linked together, and those compiled with MSVC cannot be linked with those compiled by GCC.
There is a specific and very useful exception. Every platform provides a dynamic linking ABI (Application Binary Interface) and any program in any language that can conform to that ABI is compatible. Therefore it is generally possible to build a DLL (on Windows) with MSVC (or something else) and call it from a program compiled by a different version of MSVC or by GCC and vice versa.
There are two other ABIs on Windows: COM and .NET assemblies, and they span a wide range of languages. So interoperability is definitely possible, but compatible they are not.
The degree of incompatibility can easily be seen by comparing the linker maps. For GNU use ld -M
, for MSVC use link /map
. Study the two generated files. Both will have names in them that you recognise, such as printf and main, although (depending on options) the names are likely to be mangled in various ways. They will also have names that are completely different, many of which you won't recognise. In order for object files produced by different compilers to be compatible they have to agree on all those names, and they never do. Not even different versions of the same compiler can always do that.
Best Answer
Intro
A typical compiler does the following steps:
Most modern compilers (for instance, gcc and clang) repeat the last two steps once more. They use an intermediate low-level but platform-independent language for initial code generation. Then that language is converted into platform-specific code (x86, ARM, etc) doing roughly the same thing in a platform-optimized way. This includes e.g. the use of vector instructions when possible, instruction reordering to increase branch prediction efficiency, and so on.
After that, object code is ready for linking. Most native-code compilers know how to call a linker to produce an executable, but it's not a compilation step per se. In languages like Java and C# linking may be totally dynamic, done by the VM at load time.
Remember the basics
This classic sequence applies to all software development, but bears repetition.
Concentrate on the first step of the sequence. Create the simplest thing that could possibly work.
Read the books!
Read the Dragon Book by Aho and Ullman. This is classic and is still quite applicable today.
Modern Compiler Design is also praised.
If this stuff is too hard for you right now, read some intros on parsing first; usually parsing libraries include intros and examples.
Make sure you're comfortable working with graphs, especially trees. These things is the stuff programs are made of on the logical level.
Define your language well
Use whatever notation you want, but make sure you have a complete and consistent description of your language. This includes both syntax and semantics.
It's high time to write snippets of code in your new language as test cases for the future compiler.
Use your favorite language
It's totally OK to write a compiler in Python or Ruby or whatever language is easy for you. Use simple algorithms you understand well. The first version does not have to be fast, or efficient, or feature-complete. It only needs to be correct enough and easy to modify.
It's also OK to write different stages of a compiler in different languages, if needed.
Prepare to write a lot of tests
Your entire language should be covered by test cases; effectively it will be defined by them. Get well-acquainted with your preferred testing framework. Write tests from day one. Concentrate on 'positive' tests that accept correct code, as opposed to detection of incorrect code.
Run all the tests regularly. Fix broken tests before proceeding. It would be a shame to end up with an ill-defined language that cannot accept valid code.
Create a good parser
Parser generators are many. Pick whatever you want. You may also write your own parser from scratch, but it only worth it if syntax of your language is dead simple.
The parser should detect and report syntax errors. Write a lot of test cases, both positive and negative; reuse the code you wrote while defining the language.
Output of your parser is an abstract syntax tree.
If your language has modules, the output of the parser may be the simplest representation of 'object code' you generate. There are plenty of simple ways to dump a tree to a file and to quickly load it back.
Create a semantic validator
Most probably your language allows for syntactically correct constructions that may make no sense in certain contexts. An example is a duplicate declaration of the same variable or passing a parameter of a wrong type. The validator will detect such errors looking at the tree.
The validator will also resolve references to other modules written in your language, load these other modules and use in the validation process. For instance, this step will make sure that the number of parameters passed to a function from another module is correct.
Again, write and run a lot of test cases. Trivial cases are as indispensable at troubleshooting as smart and complex.
Generate code
Use the simplest techniques you know. Often it's OK to directly translate a language construct (like an
if
statement) to a lightly-parametrized code template, not unlike an HTML template.Again, ignore efficiency and concentrate on correctness.
Target a platform-independent low-level VM
I suppose that you ignore low-level stuff unless you're keenly interested in hardware-specific details. These details are gory and complex.
Your options:
Ignore optimization
Optimization is hard. Almost always optimization is premature. Generate inefficient but correct code. Implement the whole language before you try to optimize the resulting code.
Of course, trivial optimizations are OK to introduce. But avoid any cunning, hairy stuff before your compiler is stable.
So what?
If all this stuff is not too intimidating for you, please proceed! For a simple language, each of the steps may be simpler than you might think.
Seeing a 'Hello world' from a program that your compiler created might be worth the effort.