Python – Creating Tokens for a Lexer

lexerparsingpython

I'm writing a parser for a markup language that I have created (writing in python, but that's not really relevant to this question — in fact if this seems like a bad idea, I'd love a suggestion for a better path).

I'm reading about parsers here: http://www.ferg.org/parsing/index.html, and I'm working on writing the lexer which should, if I understand correctly, split the content into tokens. What I'm having trouble understanding is what token types I should use or how to create them. For example, the token types in the example I linked to are:

STRING
IDENTIFIER
NUMBER
WHITESPACE
COMMENT
EOF
Many symbols such as { and ( count as their own token type

The problem I'm having is that the more general token types seem a bit arbitrary to me. For example, why is STRING its own separate token type vs. IDENTIFIER. A string could be represented as STRING_START + (IDENTIFIER | WHITESPACE) + STRING_START.

This may also have to do with the difficulties of my language. For example, variable declarations are written as {var-name var value} and deployed with {var-name}. It seems like '{' and '}' should be their own tokens, but are VAR_NAME and VAR_VALUE eligible token types, or would these both fall under IDENTIFIER? What's more is that the VAR_VALUE can actually contain whitespace. The whitespace after var-name is used to signify the start of the value in the declaration .. any other whitespace is part of the value. Does this whitespace become its own token? Whitespace only has that meaning in this context. Moreover, { may not be the start of a variable declaration .. it depends on the context (there's that word again!). {: starts a name declaration, and { can even be used as part of some value.

My language is similar to Python in that blocks are created with indentation. I was reading about how Python uses the lexer to create INDENT and DEDENT tokens (that serve more or less as what { and } would do in a lot of other languages). Python claims to be context-free which means to me that at least the lexer shouldn't care about where it is in the stream while creating tokens. How does Python's lexer know it's building an INDENT token of a specific length without knowing about previous characters (e.g. that the previous line was a newline, so start creating the spaces for INDENT)? I ask because I need to know this too.

My final question is the stupidest one: why is a lexer even necessary? It seems to me that the parser could go character-by-character and figure out where it is and what it expects. Does the lexer add the benefit of simplicity?

Best Answer

Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.

The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.

This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are

This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.

Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.

To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.

You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.

Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.

As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.

There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.

Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.

Related Solutions

Is it possible to create a single tokenizer to parse this

Your grammar has ambiguities that make it impossible to know what to do with, say, the letter a without context. In your case, the string abc can have two interpretations: it can be an identifier (I'm assuming that's what your first m// defines), or it can be part of a string literal quoted in your { ... } notation (I'll call that a "quoted list"). Lexical analyzers (tokenizers) aren't smart enough to handle that kind of ambiguity, because their concept of context is very simplistic. Parsers, on the other hand, can understand context at very deep levels.*

Language designers sometimes add sigils to their identifiers (e.g., $abc) to make them easier to tokenize. This is why you can have a Perl variable named $for even though bare-naked for has special meaning. For similar reasons, C lexers tokenize /"[^"]*"/ into a string literal because it has a context-independent syntax that doesn't appear anywhere else in the language.

Back to your problem: Prematurely tokenizing a string of alphanumerics into an IDENTIFIER would mean the quoted list { abc[1]xyz } would be fed to the parser as { IDENTIFIER [ NUMBER ] IDENTIFIER }. That's useful if those were the chunks in which you needed it, but you'd otherwise have to incorporate being able to handle combining all combinations of those tokens into the grammar for your quoted list. Then you'd have to handle reassembling them back into string literals. If you haven't guessed by now, that would get complex and ugly very quickly. But because parsers understand context, putting that wisdom there makes it clean and easy.

For what you're doing, there shouldn't be much of a tokenizer at all because so much of it is context-sensitive, and that's all parser territory. Whitespace doesn't seem to matter except in the context of a quoted list, so you could tokenize that as well as things that aren't ambiguous like LETTER and DIGIT.

// NOTE:  This code doesn't handle the case where whitespace is 
// interspersed with the tokens.  See the comments.

quoted-list ::= '{' quoted-list-item-set '}'

quoted-list-item-set ::=
    <nothing>
    | string-of-non-whitespace-characters
    | string-of-non-whitespace-characters WHITESPACE quoted-list-item-set

// This ends up being things you have to put together and return,
// so that eventually you end up with a single string.
string-of-non-whitespace-characters ::=
    non-whitespace-character
    | non-whitespace-character string-of-non-whitespace-characters

non-whitespace-character ::= <anything in the set '!'..'~'>

identifier ::= LETTER alphanumeric-string

alphanumeric-string ::= 
    <nothing>
    | alphanumeric alphanumeric-string

alphanumeric ::= LETTER | DIGIT

// ...etc...

// This prevents the parser from barfing on whitespace in any other context.
things-that-get-ignored ::= WHITESPACE

*This is why you should use a parser to interpret something complex like XML and not fall into the trap of trying to understand it with regular expressions.

How to Distinguish Between Keywords and Identifiers

If you notice in Ruby, you cannot call the method named like that directly, e.g. you cannot do

begin()

You can do

obj.begin()

Because there you can have grammar like:

*Arguments* :
    "(" ")"

*MemberExpression* :
    *MemberExpression* "." *IdentifierName*

*CallExpression* :
    *MemberExpression* *Arguments*

(Unrelated rules to the example left out for brevity)

to recognize it. It only requires separating the rule Identifier from IdentifierName:

*Identifier*:
    *IdentifierName* **but not reserved word**

*IdentifierName*:
    //Rules for identifier names here

If you have a starter begin like in

begin()

Then you already activated a rule like

*Block*:
    "begin" *indent* *statement* *outdent* "end"

And Ruby doesn't try to figure out what you mean and it will just be a block.

But for method names where a receiver appears or some other prefix it is easy to allow keywords in the grammar and e.g. Javascript does it doo.

Grammar examples taken from ecma-262

Best Answer

Related Solutions

Is it possible to create a single tokenizer to parse this

How to Distinguish Between Keywords and Identifiers

Related Topic