Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
Here is some code of mine without the TypeError: __call__() missing 1 required positional argument: 'self'
. How do you create this error?
>>> class X(Exception):
def __call__(self):
raise self
>>> x = X()
>>> x()
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
x()
File "<pyshell#3>", line 3, in __call__
raise self
X
>>> class Other(X):
pass
>>> o = Other()
>>> o()
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
o()
File "<pyshell#3>", line 3, in __call__
raise self
Other
Best Answer
Including the type name inside the function is vastly preferable because it allows the programmer to specify precisely what they intend. It also makes your code a lot easier to test. If you do want implicit dispatch depending on the type, you can always offer an optional wrapper. So I would write your code as:
Incidentally, you have effectively re-invented method dispatch. This run-time dispatch over function arguments is slightly different from “traditional” method dispatch in that the dispatch is handled by the function that is called, not by the value that the function is called on. In its generalization for more than one argument, this mechanism is known as multi-methods. The rest of this answer discusses argument-based method dispatch in the context of Java and Python.
Your Java example is exhibits function overloading, also known as ad-hoc polymorphism. In most languages, ad-hoc polymorphism is part of static dispatch, i.e. the overload is resolved at compile-time. When your declare two functions
the compiler recognizes them as two distinct functions that happen to share a name. So the compiler would store it as something like
When the compiler then parses a function call
flip(x)
, it will first resolve the static type ofx
. Becauseflip
is overloaded, it will try to see which overloads actually match. How that works depends on the exact rules of the language. E.g. some languages allow implicit conversions, so the types might not match exactly. In type systems with subtyping, a type will also match all its base types. This can cause multiple overloads to be legal call targets, so many languages have some kind of ranking method. E.g. calls that don't require conversions are preferred. With subtyping, the most specific type is preferred.So finally, if the term
x
has typeString
, the overloadflip_String
would be chosen.Of course, the static nature of this dispatch restricts its expressiveness. E.g. assume that class
A
inherits fromB
. We now have two functionsvoid f(A)
andvoid f(B)
. GivenB b = new A(); f(b)
, the overload chosen will always bef(B)
since that's the static type of the variableb
, even if its runtime value is of the more specific typeA
.Everything[1] that can be done statically in ahead of time compilation can also be done dynamically during runtime. While there might not be a static type in dynamic languages, the dispatch mechanism now has access to the precise dynamic type of a value. This means that runtime overload resolution can be more expressive than static resolution[1]: we can dispatch on the dynamic type of values. The mechanism is identical to the static version of the dispatch: the dispatch mechanism needs a set of available dispatch targets. When the overloaded function is called, the correct overload is selected.
[1]: Some languages allow return-type polymorphism, which is admittedly a bit difficult to do in dynamic languages unless the language is also lazy and supports symbolic execution.
This dynamic overload resolution is related to method dispatch. When we have a class
A
that is a subtype of classB
andx
is an instance of either of those classes, thenx.foo()
can do something entirely different depending on what its dynamic type is. The visitor pattern can be used to generalize this single dispatch to double dispatch, where a callx.visit(y)
is resolved based on the dynamic type of bothx
andy
. This can be generalized to multimethods, where a function is dispatched on the dynamic type of all its arguments.Unsurprisingly, Python already has a package for multimethods:
Some languages use multimethods as their primary dispatch mechanism, e.g. the Common Lisp Object System, or the Julia language. However, most cases of polymorphism can be either resolved statically, or simplified to single dispatch. Therefore, built-in language support for multimethods tends to have negligible performance impact when compared with multimethods that are added to an existing language via metaprogramming.
Implementing multi-methods can be non-trivial, because two separate overloads might look like equally good candidates. Again, assuming that
A
is a subtype ofB
:Both candidates would match equally well. There are various approaches to resolve this. One solution might be to find the overall “best” match using some kind of subtyping distance, but that tends to be hard to understand for programmers. It is more reasonable to look at all arguments from left to right, and order the overloads if one candidate is a better match for this single argument. E.g. in this example, the first argument is of type
A
, sofoo(A, B)
would be preferred overfoo(B, A)
. The second argument matches both of these cases and does not change the existing order.The mentioned Python module ignores these difficulties, and instead dispatches to the first match, no matter how well it actually matches. This requires a user to put the more specific cases before the more general case. The module's main dispatch logic boils down to this:
You can see the full source yourself under r0k3/multimethods on GitHub, or install the module via
pip install multimethods
.