Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
The basics of most procedural languages are pretty much the same.
They offer:
- Scalar data types: usually boolean, integers, floats and characters
- Compound data types: arrays (strings are special case) and structures
- Basic code constructs: arithmetic over scalars, array/structure access, assignments
- Simple control structures: if-then, if-then-else, while, for loops
- Packages of code blocks: functions, procedures with parameters
- Scopes: areas in which identifiers have specific meanings
If you understand this, you have a good grasp of 90% of the languages on the planet.
What makes these languages slightly more difficult to understand is the incredible variety of odd syntax that people use to say the same basic things. Some use terse notation involving odd punctuation (APL being an extreme). Some use lots of keywords (COBOL being an excellent representative). That doesn't matter much. What does matter is if the language is complete enough by itself to do complex tasks without causing you tear your hair out. (Try coding some serious string hacking in Window DOS shell script: it is Turing capable but really bad at everything).
More interesting procedural languages offer
- Nested or lexical scopes, namespaces
- Pointers allowing one entity to refer to another, with dynamic storage allocation
- Packaging of related code: packages, objects with methods, traits
- More sophisticated control: recursion, continuations, closures
- Specialized operators: string and array operations, math functions
While not technically a property of the langauge, but a property of the ecosystem in which such languages live, are the libraries that are easily accessible or provided with the language as part of the development tool. Having a wide range of library facilities simplifies/speeds writing applications simply because one doesn't have to reinvent what the libraries do. While Java and C# are widely thought to be good languages in and of themselves, what makes them truly useful are the huge libraries that come with them, and easily obtainable extension libraries.
The languages which are harder to understand are the non-procedural ones:
- Purely functional languages, with no assignments or side effects
- Logic languages, such as Prolog, in which symbolic computation and unification occur
- Pattern matching languages, in which you specify shapes that are matched to the problem, and often actions are triggered by a match
- Constraint languages, which let you specify relations and automatically solve equations
- Hardware description languages, in which everything executes in parallel
- Domain-specific languages, such as SQL, Colored Petri Nets, etc.
There are two major representational styles for languages:
- Text based, in which identifiers name entities and information flows are encoded implicitly in formulas that uses the identifiers to name the entities (Java, APL, ...)
- Graphical, in which entities are drawn as nodes, and relations between entities are drawn as explicit arcs between those nodes (UML, Simulink, LabView)
The graphical languages often allow textual sublanguages as annotations in nodes and on arcs. Odder graphical languages recursively allow graphs (with text :) in nodes and on arcs. Really odd graphical languages allow annotation graphs to point to graphs being annotated.
Most of these languages are based on a very small number of models of computation:
- The lambda calculus (basis for Lisp and all functional languages)
- Post systems (or string/tree/graph rewriting techniques)
- Turing machines (state modification and selection of new memory cells)
Given the focus by most of industry on procedural languages and complex control structures, you are well served if you learn one of the more interesting languages in this category well, especially if it includes some type of object-orientation.
I highly recommend learning Scheme, in particular from a really wonderful book:
Structure and Interpretation of Computer Programs. This describes all these basic concepts. If you know this stuff, other languages will seem pretty straightforward except for goofy syntax.
Best Answer
No you can't. Here is a nonsense snippet of Lisp:
What are the tokens in this snippet? The answer is something like
Lisps have very liberal rules what can be inside an identifer: hyphens
-
, asterisks*
and many other characters are allowed. Expressions can be quoted which prevents their evaluation, in many dialects this is done by prepending a'
to that expression.A lexer needs to be aware of the language that it's parsing. If we would have used a C language lexer on the above code, we would get an error because
'
introduces a character literal which also requires a closing'
– none is present here. Our C tokenizer might produce:For each language we want to lex, we have to write special rules. Every language is unique, and many languages do have subtle differences. Here, the identifiers are so very different that the token streams have lost any resemblance of each other. However, some languages inherit syntax from each other. For a very simple snippet like
foo.bar(baz)
, a lexer written for C++ and one for Python might produce comparable results.But why do we actually use Lexers?
Formally, lexers are not needed. We can describe a language without them just fine. Lexers are a performance optimization. Lexers are basically very restricted preprocessors for a parser that can match the input very efficiently (e.g. implemented as a state machine). The lexer then emits tokens, which are larger building blocks (a token is a pairing of a type or ID with a string). It is then the job of the parser to combine these tokens in a hierarchical structure (generally, an abstract syntax tree). This frees the parser from doing character-level operations, which makes it more efficient and easier to maintain (I have written parser without this separation. It works, but is annoying).
And here we have another problem: A lexer is written for a specific parser. The two work in tandem and cannot be recombined willy-nilly with other lexers or parsers. In the above token streams I have called the "(" operator
LPAREN
. But a different grammar might require aOP_LP
symbol instead.So for a variety of reasons, it's impossible to write one lexer to rule them all.
Each language needs it's own lexer. But we can take shortcuts and use programs that generate specialized lexers for us from some specification. Often, tools like
lex
are used for this. If performance is not an issue, I use the regular expressions library from the host language to cobble together a simple lexer, e.g. in Perl:Output:
But we can easily change the token definitions:
which changes the output to:
We didn't write a new lexer, we just updated the token definitions.
What to do now:
As a preparation, understand and implement the Shunting-Yard Algorithm. This will teach you about a simple stack machine.
Read up on grammars and languages in computer science. Learn what a rule is.
Read up on Extended Backus-Naur Form notation, even if you don't immediately understand it. This is a way to write down grammars, and grammars are a way to describe languages (e.g. programming languages).
Understand the implications of the Chomsky Hierarchy. Here are the three important levels:
Note that real programming languages are not context-free – semantic whitespace as in Python, here-docs as in Bash, and optional end tags in HTML are there to mess up most elegant descriptions (but tricks exist to fake a context-free grammar even then).
Write a simple recursive-descent parser yourself that can evaluate simple arithmetics like
1 + 2 * 3
and manages operator precedence correctly. Using a lexer is not necessary for this. Add parens to your language, then some functions likeabs
orsin
. Post your solution on codereview to get tips on how to do it better.Now that you've learned how to parse stuff yourself, you may enjoy helping tools like parser generators. As an exercise, pick a simple language like JSON and write a parser for it with your tool of choice. Return to codereview for criticism.
If you have any problems on the way, ask here on programmers or over on stackoverflow, depending on whether the question is conceptual or implementation-related.