Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
No you can't. Here is a nonsense snippet of Lisp:
(*foo-bar* 'baz 14)
What are the tokens in this snippet? The answer is something like
LPAREN
SYMBOL "*foo-bar*"
QUOTE
SYMBOL "baz"
INTEGER "14"
RPAREN
Lisps have very liberal rules what can be inside an identifer: hyphens -
, asterisks *
and many other characters are allowed. Expressions can be quoted which prevents their evaluation, in many dialects this is done by prepending a '
to that expression.
A lexer needs to be aware of the language that it's parsing. If we would have used a C language lexer on the above code, we would get an error because '
introduces a character literal which also requires a closing '
– none is present here. Our C tokenizer might produce:
LPAREN
OP_STAR
IDENTIFIER "foo"
OP_MINUS
IDENTIFIER "bar"
OP_STAR
ERROR character literal isn't terminated: expected »'« after »'b«
For each language we want to lex, we have to write special rules. Every language is unique, and many languages do have subtle differences. Here, the identifiers are so very different that the token streams have lost any resemblance of each other. However, some languages inherit syntax from each other. For a very simple snippet like foo.bar(baz)
, a lexer written for C++ and one for Python might produce comparable results.
But why do we actually use Lexers?
Formally, lexers are not needed. We can describe a language without them just fine. Lexers are a performance optimization. Lexers are basically very restricted preprocessors for a parser that can match the input very efficiently (e.g. implemented as a state machine). The lexer then emits tokens, which are larger building blocks (a token is a pairing of a type or ID with a string). It is then the job of the parser to combine these tokens in a hierarchical structure (generally, an abstract syntax tree). This frees the parser from doing character-level operations, which makes it more efficient and easier to maintain (I have written parser without this separation. It works, but is annoying).
And here we have another problem: A lexer is written for a specific parser. The two work in tandem and cannot be recombined willy-nilly with other lexers or parsers. In the above token streams I have called the "(" operator LPAREN
. But a different grammar might require a OP_LP
symbol instead.
So for a variety of reasons, it's impossible to write one lexer to rule them all.
Each language needs it's own lexer. But we can take shortcuts and use programs that generate specialized lexers for us from some specification. Often, tools like lex
are used for this. If performance is not an issue, I use the regular expressions library from the host language to cobble together a simple lexer, e.g. in Perl:
# a primitive Lisp lexer
my %tokens = (
LPAREN => qr/[(]/,
RPAREN => qr/[)]/,
SYMBOL => qr/[^\s()']+/,
QUOTE => qr/[']/,
INTEGER => qr/[1-9][0-9]*/,
);
my $space = qr/\s+/;
my $input = "(*foo-bar* 'baz 14)";
my $pos = 0;
my $length = length $input;
POSITION:
while ($pos < $length) {
pos($input) = $pos;
$pos += length $1 and next if $input =~ /\G($space)/gc;
for my $name (keys %tokens) {
if ($input =~ /\G($tokens{$name})/gc) {
print "$name -- $1\n";
$pos += length $1;
next POSITION;
}
}
die "No token matched, just before:\n", (substr $input, $pos, 20), "\n";
}
Output:
LPAREN -- (
SYMBOL -- *foo-bar*
QUOTE -- '
SYMBOL -- baz
SYMBOL -- 14
RPAREN -- )
But we can easily change the token definitions:
# an incomplete C lexer
my %tokens = (
LPAREN => qr/[(]/,
RPAREN => qr/[)]/,
IDENTIFIER => qr/[a-zA-Z_][a-zA-Z0-9_]*/,
CHARACTER => qr/['](?:[^'\\]|\\.)[']/,
INTEGER => qr/[1-9][0-9]*/,
OP_STAR => qr/[*]/,
OP_MINUS => qr/[-]/,
);
my $space = qr/\s+/;
which changes the output to:
LPAREN -- (
OP_STAR -- *
IDENTIFIER -- foo
OP_MINUS -- -
IDENTIFIER -- bar
OP_STAR -- *
No token matched, just before:
'baz 14)
We didn't write a new lexer, we just updated the token definitions.
What to do now:
As a preparation, understand and implement the Shunting-Yard Algorithm. This will teach you about a simple stack machine.
Read up on grammars and languages in computer science. Learn what a rule is.
Read up on Extended Backus-Naur Form notation, even if you don't immediately understand it. This is a way to write down grammars, and grammars are a way to describe languages (e.g. programming languages).
Understand the implications of the Chomsky Hierarchy. Here are the three important levels:
- Regular Languages can be matched very efficiently. Lexers usually operate on this level.
- Context-Free Languages are the basis for the description of many programming languages. Various parsing strategies can match various subsets. You will want to read up on LL, LR, and PEG at least.
- Context-Sensitive Languages are even more general, but they are vastly more difficult to parse. Note that many regex libraries like PCRE have context-sensitive features.
Note that real programming languages are not context-free – semantic whitespace as in Python, here-docs as in Bash, and optional end tags in HTML are there to mess up most elegant descriptions (but tricks exist to fake a context-free grammar even then).
Write a simple recursive-descent parser yourself that can evaluate simple arithmetics like 1 + 2 * 3
and manages operator precedence correctly. Using a lexer is not necessary for this. Add parens to your language, then some functions like abs
or sin
. Post your solution on codereview to get tips on how to do it better.
Now that you've learned how to parse stuff yourself, you may enjoy helping tools like parser generators. As an exercise, pick a simple language like JSON and write a parser for it with your tool of choice. Return to codereview for criticism.
If you have any problems on the way, ask here on programmers or over on stackoverflow, depending on whether the question is conceptual or implementation-related.
Best Answer
If you notice in Ruby, you cannot call the method named like that directly, e.g. you cannot do
You can do
Because there you can have grammar like:
(Unrelated rules to the example left out for brevity)
to recognize it. It only requires separating the rule Identifier from IdentifierName:
If you have a starter
begin
like inThen you already activated a rule like
And Ruby doesn't try to figure out what you mean and it will just be a block.
But for method names where a receiver appears or some other prefix it is easy to allow keywords in the grammar and e.g. Javascript does it doo.
Grammar examples taken from ecma-262