Parsing – Proper Separation Between Lexing and Parsing

parsing

I am currently writing a parser which, given a source file, turns it into an AST of some language, respecting the idiomatic process of lexing and then parsing using well-known parser generators (think lex and yacc). However, I am unsure as how to properly distinguish which steps should happen during lexing and which during parsing.

Consider the statement int x = 0xFFFFFFFF;. I am unsure as how to correctly lex the value. Should I convert it to an integer at lex time, therefore lexing INT[-1] or let the parser deal with this, lexing something akin to HEX_INT[0xF..]?

Alternatively, consider the statement char c = '\u0048'. Should the parser convert this to the properly UTF-8 encoded sequence or pass the encoded character along?

Any guidelines or recommmendation on correctly drawing the line between parsing and lexing would prove very helpful.

Best Answer

The distinction between lexing and parsing is just a matter of convenience. As such, do whatever is convenient for your problem.

In particular, it is never necessary to use a lexer. In principle, a parser could just as well work on a per-character basis instead of a per-token basis. There are languages that can't be parsed with an ordinary lexer (e.g. anything with contextual keywords), but if you can use one, the memory usage of the parser tends to be lower (since there are fewer tokens than characters), and the grammar tends to be simpler (since the token grammar is handled by the lexer, and comments, insignificant whitespace, etc. are usually eliminated by the lexer).

Regarding conversion of literals, this depends on the language. There are languages where literals don't have an inherent type (e.g. in Haskell 2 can be any numeric type, depending on context). There are languages with user-defined literal parsing (e.g. C++). There are languages where strings can contain interpolated code or locale-sensitive casing operators. All of these are problems that can't be solved by the lexer, and will have to be handled at another stage of the parser.

For a simple C-like language, all of those issues don't exist, and a lexer could convert the values directly. But should we?

A token is usually a tuple or record (type, location, value), where the value is the corresponding lexed string. The value may be empty where the string is useless, e.g. for keywords or operators. If the value is a string for some token-types and an integer for other token-types, juggling these different types can become awkward and error-prone, especially if your parser is implemented in a language like C (you'd have to use unions!). It would then be better to convert the values in the parser, immediately before AST construction. This is of course not an issue when the host language uses dynamic typing, or if the token-types are represented as distinct types in the host language.

Another consideration is the quality of error messages. Imagine that, for some reason, a user writes 0.5E3. During parsing you encounter a problem, like floating point numbers not being allowed in that syntactic context. Would your error message report the offending token as 0.5E3 or 500.0? Similar with numbers in hex, octal, or binary. Also very important when strings/characters contain non-printable characters, soft hyphens, or Unicode lookalikes (homographs). By converting the literals early, you are throwing information about the exact formatting away, but this information may be very helpful to a confused user. This is not a huge issue (especially when your error message quotes the line containing the error and points out the exact error location), but it's something to consider. Ideally you can show both – the original formatting and a normalized form.