What advantages are there to using Flex/Bison over just writing a compiler straight in a programming language

algorithmscompiler

I have written an Interpreter / Compiler simply as a programming algorithm using features such as recursive functions and concepts from finite-State machines. To me, these things are ideally suited to building a compiler (Although it does end up getting pretty complex).

Doing a little bit of research on the subject, I came across Flex/Bison. I'm wondering how these tools make it better to develop a compiler. Is is simpler to use that or is the end product better in some way?

Best Answer

Compiler generators (such as Bison) use pretty much the same concepts from finite state machines as you have used in your hand-crafted parser.

The main differences between compiler generators and hand-crafted parsers are

Most hand-crafted parsers use a top-down parsing strategy, while compiler generators often create a bottom-up parser. This has a slight impact on the languages that can be supported, although most languages of real interest can be handled by both (possibly after slight tweaks to the grammar rules). This has mostly to do with recursion in the grammar.
With a compiler generator, you specify the language's grammar directly, in a form that is usually close to the BNF form that is commonly used for describing language grammars. With a hand-crafted parser, it can sometimes be hard to recognize the language's grammar in the code, which can make it more difficult to diagnose errors resulting from incorrect interpretation of the language rules.
With a compiler generator, you may have to learn a new language (the language accepted by the generator). Hand-crafted parsers can be written in most languages.

Related Solutions

Python Compiler – Challenges of Typing in Dynamically Typed Languages

You oversimplified Guido's statement in phrasing your question. The problem isn't writing a compiler for a dynamically-typed language. The problem is writing one that is (criteria 1) always correct, (criteria 2) keeps dynamic typing, and (criteria 3) is noticeably faster for a significant amount of code.

It's easy to implement 90% (failing criteria 1) of Python and be consistently fast at it. Similarly, it's easy to create a faster Python variant with static typing (failing criteria 2). Implementing 100% is also easy (insofar implementing a language that complex is easy), but so far every easy way to implement it turns out to be relatively slow (failing criteria 3).

Implementing an interpreter plus JIT that's correct, implements the entire language, and is faster for some code turns out to be feasible, though significantly harder (cf. PyPy) and only so if you automate the creation of the JIT compiler (Psyco did without it, but was very limited in what code it could speed up). But note that this is explicitly out of scope, as we're talking about static (aka ahead-of-time) compilers. I only mention this to explain why its approach does not work for static compilers (or at least there's no existing counterexample): It first has to interpret and observe the program, then generate code for a specific iteration of a loop (or another linear code path), then optimize the hell out of that based on assumptions only true for that specific iteration (or at least, not for all possible iterations). The expectation is that many later executions of that code will also match the expectation and thus benefit from the optimizations. Some (relatively cheap) checks are added to assure correctness. To do all this, you need an idea of what to specialize for, and a slow but general implementation to fall back to. AOT compilers have neither. They can't specialize at all based on code they can't see (e.g. dynamically loaded code), and specializing carelessly means generating more code, which has a number of problems (icache utilization, binary size, compile time, additional branches).

Implementing an AOT compiler that correctly implements the entire language is also relatively easy: Generate code that calls into the runtime to do what the interpreter would do when fed with this code. Nuitka (mostly) does this. However, this doesn't yield much performance benefit (failing criteria 3), as you still have to do just as much unnecessary work as an interpreter, save for dispatching the bytecode to the block of C code which does what you compiled in. But that's only a rather small cost -- significant enough to be worth optimizing in an existing interpreter, but not significant enough to justify a whole new implementation with its own problems.

What would be needed to fulfill all three criteria? We have no idea. There are some static analysis schemes which can extract some information about concrete types, control flow, etc. from Python programs. The ones that yield accurate data beyond the scope of a single basic block are extremely slow and need to see the whole program, or at least most of it. Still, you can't do much with that information, other than perhaps optimize a few operations on builtin types.

Why's that? To put it bluntly, a compiler either removes the ability to execute Python code loaded at runtime (failing criteria 1), or it does not make any assumptions that can be invalidated by any Python code at all. Unfortunately, that includes pretty much everything useful for optimizing programs: Globals including functions can be rebound, classes can be mutated or replaced entirely, modules can be modified arbitrarily too, importing can be hijacked in several ways, etc. A single string passed to eval, exec, __import__, or numerous other functions, may do any of that. In effect, that means almost no big optimizations can be applied, yielding little performance benefit (failing criteria 3). Back to the above paragraph.

Writing a Compiler Compiler – Insight on Use and Features

I've been working on a lot of parsing recently, and some of the key features are:

A programmatic API -- so it can be used from within a programming language, ideally by simply importing a library. It can have a GUI or BNF-like interface as well, but the programmatic one is the key, because you can re-use your tooling, IDE, static analysis, testing, language abstraction facilities, programmer familiarity, documentation generator, build process, etc. Plus, you can interactively play with tiny parsers, which dramatically reduces the learning curve. These reasons place it at the top of my "important features" list.
Error reporting, as @guysherman mentioned. When an error is found, I want to know where the error was and what was going on when it happened. Unfortunately, I haven't been able to find good resources for explaining how to generate decent errors when backtracking comes in to play. (Although note @Sk-logic's comment below).
Partial results. When parsing fails, I want to be able to see what was successfully parsed from the part of the input that was before the location of the error.
Abstraction. You can never build in enough functions, and the user will always need more, so trying to figure out all the possible functions up-front is doomed to failure. Is this what you mean by templates?
I agree with your #2 (look-ahead prediction). I think it helps to generate good error reports. Is it useful for anything else?
Support for building a parse tree as parsing occurs, perhaps:
- a concrete syntax tree, for which the structure of the tree corresponds directly to the grammar and includes layout information for error reporting of later phases. In this case, the client shouldn't have to do anything to get the right tree structure -- it should depend directly on the grammar.
- An abstract syntax tree. In this case, the user is able to fiddle with any and all parse trees
Some kind of logging. I'm not sure about this one; maybe to show a trace of the rules the parser has tried, or to keep track of junk tokens such as whitespace or comments, in case (for instance) you want to use the tokens to generate HTML documentation.
Ability to deal with context sensitive languages. Not sure how important this one is, either -- in practice, it seems cleaner to parse a superset of a language with a context-free grammar, then to check context-sensitive constraints in additional later passes.
Custom error messages, so that I can tune the error reports in specific situations and perhaps more quickly understand and fix problems.

On the other hand, I don't find error-correction particularly important -- although I'm not up to date on current progress. The problems I've noticed are that the potential corrections that the tools supply are: 1) too numerous, and 2) don't correspond to the actual mistakes made, and so aren't all that helpful. Hopefully this situation will improve (or perhaps has already done so).

Best Answer

Related Solutions

Python Compiler – Challenges of Typing in Dynamically Typed Languages

Writing a Compiler Compiler – Insight on Use and Features

Related Topic