Writing a Lexer Based on Grammar – Procedure

designdesign-patternsdevelopment-processgrammarlexer

While reading through an answer to the question Clarification about Grammars , Lexers and Parsers, the answer stated that:

[…] a BNF grammar contains all the rules you need for lexical analysis and parsing.

This came across as somewhat odd to me because up until now, I'd always thought that a lexer was not based upon a grammar at all, while a parser was heavily based upon one. I had come to this conclusion after reading numerous blog post about writing lexers, and not one ever using 1EBNF/BNF as a basis for design.

If lexers, as well as parsers, are based upon an EBNF/BNF grammar, then how would one go about creating a lexer using that method? That is, how would I construct a lexer using a given EBNF/BNF grammar?

I've seen many, many post that deal with writing a parser using EBNF/BNF as a guide or a blueprint, but I've come across none so far that show the equivalent with lexer design.

For example, take the following grammar:

input = digit| string ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
string = '"', { all characters - '"' }, '"' ;
all characters = ? all visible characters ? ;

How would one create a lexer that is based on the grammar? I could imagine how a parser could be written from such a grammar, but I fail to grasp the concept of doing the same with a lexer.

Are there certain rules or logic used to accomplish a task such as this, as with writing a parser? Frankly, I'm beginning to wonder whether lexer designs use an EBNF/BNF grammar at all, let alone are based upon one.


1Extended Backus–Naur form and Backus–Naur form

Best Answer

Lexers are just simple parsers that are used as a performance optimisation for the main parser. If we have a lexer, the lexer and the parser work together to describe the complete language. Parsers that don't have a separate lexing stage are sometimes called “scannerless”.

Without lexers, the parser would have to operate on a character-by-character basis. Since the parser has to store metadata about every input item, and may have to pre-calculate tables for every input item state, this would result in unacceptable memory consumption for large input sizes. In particular, we don't need a separate node per character in the abstract syntax tree.

Since text on a character-by-character basis is fairly ambiguous, this would also result in much more ambiguity that is annoying to handle. Imagine a rule R → identifier | "for " identifier. where identifier is made up from ASCII letters. If I want to avoid ambiguity, I now need a 4-character lookahead to determine which alternative should be chosen. With a lexer, the parser just has to check whether it has an IDENTIFIER or FOR token – a 1-token lookahead.

Two-level grammars.

Lexers work by translating the input alphabet to a more convenient alphabet.

A scannerless parser describes a grammar (N, Σ, P, S) where the non-terminals N are the left hand sides of the rules in the grammar, the alphabet Σ is e.g. ASCII characters, the productions P are the rules in the grammar, and the start symbol S is the parser's top level rule.

The lexer now defines an alphabet of tokens a, b, c, …. This allows the main parser to use these tokens as alphabet: Σ = {a, b, c, …}. For the lexer, these tokens are non-terminals, and the start rule SL is SL → ε | a S | b S | c S | …, that is: any sequence of tokens. The rules in the lexer grammar are all rules necessary to produce these tokens.

The performance advantage comes from expressing the lexer's rules as a regular language. These can be parsed much more efficiently than context-free languages. In particular, regular languages can be recognized in O(n) space and O(n) time. In practice, a code generator can turn such a lexer into highly efficient jump tables.

Extracting tokens from your grammar.

To touch on your example: the digit and string rules are expressed on a character-by-character level. We could use those as tokens. The rest of the grammar stays intact. Here's the lexer grammar, written as a right-linear grammar to make it clear that it's regular:

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
string = '"' , string-rest ;
string-rest = '"' | STRING-CHAR, string-rest ;
STRING-CHAR = ? all visible characters ? - '"' ;

But since it's regular, we would usually use regular expressions to express the token syntax. Here are the above token definitions as regexes, written using .NET character class exclusion syntax and POSIX charclasses:

digit ~ [0-9]
string ~ "[[:print:]-["]]*"

The grammar for the main parser then contains the remaining rules not handled by the lexer. In your case, that's just:

input = digit | string ;

When lexers can't be used easily.

When designing a language, we would usually take care that the grammar can be separated cleanly into a lexer level and a parser level, and that the lexer level describes a regular language. This is not always possible.

  • When embedding languages. Some languages allow you to interpolate code into strings: "name={expression}". The expression syntax is part of the context-free grammar and therefore can't be tokenized by a regular expression. To solve this, we either recombine the parser with the lexer, or we introduce additional tokens like STRING-CONTENT, INTERPOLATE-START, INTERPOLATE-END. The grammar rule for a string might then look like: String → STRING-START STRING-CONTENTS { INTERPOLATE-START Expression INTERPOLATE-END STRING-CONTENTS } STRING-END. Of course the Expression may contain other strings, which leads us to the next problem.

  • When tokens could contain each other. In C-like languages, keywords are indistinguishable from identifiers. This is solved in the lexer by prioritizing keywords over identifiers. Such a strategy isn't always possible. Imagine a config file where Line → IDENTIFIER " = " REST, where the rest is any character until the end of the line, even if the rest looks like an identifier. An example line would be a = b c. The lexer is really dumb and does not know in which order the tokens may occur. So if we prioritize IDENTIFIER over REST, the lexer would give us IDENT(a), " = ", IDENT(b), REST( c). If we prioritize REST over IDENTIFIER, the lexer would just give us REST(a = b c).

    To solve this, we have to recombine the lexer with the parser. The separation can be maintained somewhat by making the lexer lazy: each time the parser needs the next token, it requests it from the lexer and tells the lexer the set of acceptable tokens. Effectively, we are creating a new top-level rule for the lexer grammar for each position. Here, this would result in the calls nextToken(IDENT), nextToken(" = "), nextToken(REST), and everything works fine. This requires a parser that knows the complete set of acceptable tokens at each location, which implies a bottom-up parser like LR.

  • When the lexer has to maintain state. E.g. the Python language delimits code blocks not by curly braces, but by indentation. There are ways to handle layout-sensitive syntax within a grammar, but those techniques are overkill for Python. Instead, the lexer checks the indentation of each line, and emits INDENT tokens if a new indented block is found, and DEDENT tokens if the block has ended. This simplifies the main grammar because it can now pretend those tokens are like curly braces. The lexer however now needs to maintain state: the current indentation. This means the lexer technically no longer describes a regular language, but actually a context-sensitive language. Luckily this difference is not relevant in practice, and Python's lexer can still work in O(n) time.