Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
Intro
A typical compiler does the following steps:
- Parsing: the source text is converted to an abstract syntax tree (AST).
- Resolution of references to other modules (C postpones this step till linking).
- Semantic validation: weeding out syntactically correct statements that make no sense, e.g. unreachable code or duplicate declarations.
- Equivalent transformations and high-level optimization: the AST is transformed to represent a more efficient computation with the same semantics. This includes e.g. early calculation of common subexpressions and constant expressions, eliminating excessive local assignments (see also SSA), etc.
- Code generation: the AST is transformed into linear low-level code, with jumps, register allocation and the like. Some function calls can be inlined at this stage, some loops unrolled, etc.
- Peephole optimization: the low-level code is scanned for simple local inefficiencies which are eliminated.
Most modern compilers (for instance, gcc and clang) repeat the last two steps once more. They use an intermediate low-level but platform-independent language for initial code generation. Then that language is converted into platform-specific code (x86, ARM, etc) doing roughly the same thing in a platform-optimized way. This includes e.g. the use of vector instructions when possible, instruction reordering to increase branch prediction efficiency, and so on.
After that, object code is ready for linking. Most native-code compilers know how to call a linker to produce an executable, but it's not a compilation step per se. In languages like Java and C# linking may be totally dynamic, done by the VM at load time.
Remember the basics
- Make it work
- Make it beautiful
- Make it efficient
This classic sequence applies to all software development, but bears repetition.
Concentrate on the first step of the sequence. Create the simplest thing that could possibly work.
Read the books!
Read the Dragon Book by Aho and Ullman. This is classic and is still quite applicable today.
Modern Compiler Design is also praised.
If this stuff is too hard for you right now, read some intros on parsing first; usually parsing libraries include intros and examples.
Make sure you're comfortable working with graphs, especially trees. These things is the stuff programs are made of on the logical level.
Define your language well
Use whatever notation you want, but make sure you have a complete and consistent description of your language. This includes both syntax and semantics.
It's high time to write snippets of code in your new language as test cases for the future compiler.
Use your favorite language
It's totally OK to write a compiler in Python or Ruby or whatever language is easy for you. Use simple algorithms you understand well. The first version does not have to be fast, or efficient, or feature-complete. It only needs to be correct enough and easy to modify.
It's also OK to write different stages of a compiler in different languages, if needed.
Prepare to write a lot of tests
Your entire language should be covered by test cases; effectively it will be defined by them. Get well-acquainted with your preferred testing framework. Write tests from day one. Concentrate on 'positive' tests that accept correct code, as opposed to detection of incorrect code.
Run all the tests regularly. Fix broken tests before proceeding. It would be a shame to end up with an ill-defined language that cannot accept valid code.
Create a good parser
Parser generators are many. Pick whatever you want. You may also write your own parser from scratch, but it only worth it if syntax of your language is dead simple.
The parser should detect and report syntax errors. Write a lot of test cases, both positive and negative; reuse the code you wrote while defining the language.
Output of your parser is an abstract syntax tree.
If your language has modules, the output of the parser may be the simplest representation of 'object code' you generate. There are plenty of simple ways to dump a tree to a file and to quickly load it back.
Create a semantic validator
Most probably your language allows for syntactically correct constructions that may make no sense in certain contexts. An example is a duplicate declaration of the same variable or passing a parameter of a wrong type. The validator will detect such errors looking at the tree.
The validator will also resolve references to other modules written in your language, load these other modules and use in the validation process. For instance, this step will make sure that the number of parameters passed to a function from another module is correct.
Again, write and run a lot of test cases. Trivial cases are as indispensable at troubleshooting as smart and complex.
Generate code
Use the simplest techniques you know. Often it's OK to directly translate a language construct (like an if
statement) to a lightly-parametrized code template, not unlike an HTML template.
Again, ignore efficiency and concentrate on correctness.
Target a platform-independent low-level VM
I suppose that you ignore low-level stuff unless you're keenly interested in hardware-specific details. These details are gory and complex.
Your options:
- LLVM: allows for efficient machine code generation, usually for x86 and ARM.
- CLR: targets .NET, multiplatform; has a good JIT.
- JVM: targets Java world, quite multiplatform, has a good JIT.
Ignore optimization
Optimization is hard. Almost always optimization is premature.
Generate inefficient but correct code. Implement the whole language before you try to optimize the resulting code.
Of course, trivial optimizations are OK to introduce. But avoid any cunning, hairy stuff before your compiler is stable.
So what?
If all this stuff is not too intimidating for you, please proceed! For a simple language, each of the steps may be simpler than you might think.
Seeing a 'Hello world' from a program that your compiler created might be worth the effort.
Best Answer
Keep in mind that every finite state machine corresponds to a regular expression, which corresponds to a structured program using
if
andwhile
statements.So, for example, to recognize integers you could have the state machine:
or the regular expression:
or the structured code:
Personally, I always write lexers using the latter, because IMHO it is no less clear, and there is nothing faster.