Your grammar has ambiguities that make it impossible to know what to do with, say, the letter a
without context. In your case, the string abc
can have two interpretations: it can be an identifier (I'm assuming that's what your first m//
defines), or it can be part of a string literal quoted in your { ... }
notation (I'll call that a "quoted list"). Lexical analyzers (tokenizers) aren't smart enough to handle that kind of ambiguity, because their concept of context is very simplistic. Parsers, on the other hand, can understand context at very deep levels.*
Language designers sometimes add sigils to their identifiers (e.g., $abc
) to make them easier to tokenize. This is why you can have a Perl variable named $for
even though bare-naked for
has special meaning. For similar reasons, C lexers tokenize /"[^"]*"/
into a string literal because it has a context-independent syntax that doesn't appear anywhere else in the language.
Back to your problem: Prematurely tokenizing a string of alphanumerics into an IDENTIFIER
would mean the quoted list { abc[1]xyz }
would be fed to the parser as {
IDENTIFIER
[
NUMBER
]
IDENTIFIER
}
. That's useful if those were the chunks in which you needed it, but you'd otherwise have to incorporate being able to handle combining all combinations of those tokens into the grammar for your quoted list. Then you'd have to handle reassembling them back into string literals. If you haven't guessed by now, that would get complex and ugly very quickly. But because parsers understand context, putting that wisdom there makes it clean and easy.
For what you're doing, there shouldn't be much of a tokenizer at all because so much of it is context-sensitive, and that's all parser territory. Whitespace doesn't seem to matter except in the context of a quoted list, so you could tokenize that as well as things that aren't ambiguous like LETTER
and DIGIT
.
// NOTE: This code doesn't handle the case where whitespace is
// interspersed with the tokens. See the comments.
quoted-list ::= '{' quoted-list-item-set '}'
quoted-list-item-set ::=
<nothing>
| string-of-non-whitespace-characters
| string-of-non-whitespace-characters WHITESPACE quoted-list-item-set
// This ends up being things you have to put together and return,
// so that eventually you end up with a single string.
string-of-non-whitespace-characters ::=
non-whitespace-character
| non-whitespace-character string-of-non-whitespace-characters
non-whitespace-character ::= <anything in the set '!'..'~'>
identifier ::= LETTER alphanumeric-string
alphanumeric-string ::=
<nothing>
| alphanumeric alphanumeric-string
alphanumeric ::= LETTER | DIGIT
// ...etc...
// This prevents the parser from barfing on whitespace in any other context.
things-that-get-ignored ::= WHITESPACE
*This is why you should use a parser to interpret something complex like XML and not fall into the trap of trying to understand it with regular expressions.
Lexers are just simple parsers that are used as a performance optimisation for the main parser. If we have a lexer, the lexer and the parser work together to describe the complete language. Parsers that don't have a separate lexing stage are sometimes called “scannerless”.
Without lexers, the parser would have to operate on a character-by-character basis. Since the parser has to store metadata about every input item, and may have to pre-calculate tables for every input item state, this would result in unacceptable memory consumption for large input sizes. In particular, we don't need a separate node per character in the abstract syntax tree.
Since text on a character-by-character basis is fairly ambiguous, this would also result in much more ambiguity that is annoying to handle. Imagine a rule R → identifier | "for " identifier
. where identifier is made up from ASCII letters. If I want to avoid ambiguity, I now need a 4-character lookahead to determine which alternative should be chosen. With a lexer, the parser just has to check whether it has an IDENTIFIER or FOR token – a 1-token lookahead.
Two-level grammars.
Lexers work by translating the input alphabet to a more convenient alphabet.
A scannerless parser describes a grammar (N, Σ, P, S) where the non-terminals N are the left hand sides of the rules in the grammar, the alphabet Σ is e.g. ASCII characters, the productions P are the rules in the grammar, and the start symbol S is the parser's top level rule.
The lexer now defines an alphabet of tokens a, b, c, …. This allows the main parser to use these tokens as alphabet: Σ = {a, b, c, …}. For the lexer, these tokens are non-terminals, and the start rule SL is SL → ε | a S | b S | c S | …, that is: any sequence of tokens. The rules in the lexer grammar are all rules necessary to produce these tokens.
The performance advantage comes from expressing the lexer's rules as a regular language. These can be parsed much more efficiently than context-free languages. In particular, regular languages can be recognized in O(n) space and O(n) time. In practice, a code generator can turn such a lexer into highly efficient jump tables.
Extracting tokens from your grammar.
To touch on your example: the digit
and string
rules are expressed on a character-by-character level. We could use those as tokens. The rest of the grammar stays intact. Here's the lexer grammar, written as a right-linear grammar to make it clear that it's regular:
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
string = '"' , string-rest ;
string-rest = '"' | STRING-CHAR, string-rest ;
STRING-CHAR = ? all visible characters ? - '"' ;
But since it's regular, we would usually use regular expressions to express the token syntax. Here are the above token definitions as regexes, written using .NET character class exclusion syntax and POSIX charclasses:
digit ~ [0-9]
string ~ "[[:print:]-["]]*"
The grammar for the main parser then contains the remaining rules not handled by the lexer. In your case, that's just:
input = digit | string ;
When lexers can't be used easily.
When designing a language, we would usually take care that the grammar can be separated cleanly into a lexer level and a parser level, and that the lexer level describes a regular language. This is not always possible.
When embedding languages. Some languages allow you to interpolate code into strings: "name={expression}"
. The expression syntax is part of the context-free grammar and therefore can't be tokenized by a regular expression. To solve this, we either recombine the parser with the lexer, or we introduce additional tokens like STRING-CONTENT, INTERPOLATE-START, INTERPOLATE-END
. The grammar rule for a string might then look like: String → STRING-START STRING-CONTENTS { INTERPOLATE-START Expression INTERPOLATE-END STRING-CONTENTS } STRING-END
. Of course the Expression may contain other strings, which leads us to the next problem.
When tokens could contain each other. In C-like languages, keywords are indistinguishable from identifiers. This is solved in the lexer by prioritizing keywords over identifiers. Such a strategy isn't always possible. Imagine a config file where Line → IDENTIFIER " = " REST
, where the rest is any character until the end of the line, even if the rest looks like an identifier. An example line would be a = b c
. The lexer is really dumb and does not know in which order the tokens may occur. So if we prioritize IDENTIFIER over REST, the lexer would give us IDENT(a), " = ", IDENT(b), REST( c)
. If we prioritize REST over IDENTIFIER, the lexer would just give us REST(a = b c)
.
To solve this, we have to recombine the lexer with the parser. The separation can be maintained somewhat by making the lexer lazy: each time the parser needs the next token, it requests it from the lexer and tells the lexer the set of acceptable tokens. Effectively, we are creating a new top-level rule for the lexer grammar for each position. Here, this would result in the calls nextToken(IDENT), nextToken(" = "), nextToken(REST)
, and everything works fine. This requires a parser that knows the complete set of acceptable tokens at each location, which implies a bottom-up parser like LR.
When the lexer has to maintain state. E.g. the Python language delimits code blocks not by curly braces, but by indentation. There are ways to handle layout-sensitive syntax within a grammar, but those techniques are overkill for Python. Instead, the lexer checks the indentation of each line, and emits INDENT tokens if a new indented block is found, and DEDENT tokens if the block has ended. This simplifies the main grammar because it can now pretend those tokens are like curly braces. The lexer however now needs to maintain state: the current indentation. This means the lexer technically no longer describes a regular language, but actually a context-sensitive language. Luckily this difference is not relevant in practice, and Python's lexer can still work in O(n) time.
Best Answer
Yes, a BNF grammar contains all the rules you need for lexical analysis and parsing. The difference between the two is a little fuzzy. A good example of a lexical rule in EBNF would be:
Usually lexers can be implemented using relatively simple code. You can search a string for the next space, then see if your result starts with an optional "-", contains at least one digit after that, and contains only digits after that. Lexers used to almost always be a separate step, but are usually lumped together with the parser nowadays. Hence the fuzziness.
A parser rule would use the
number
non-terminal to make something larger, like the following addition expression.Even though they are mixed up in the same file, your professor is still going to want to see a clear distinction between "lexer" rules and "parser" rules. For example, don't do this:
Not only is that error prone, it's difficult to read and difficult to implement.