Compiler – Should a Lexer Allow Obvious Syntax Errors?

compilerlexerparsing

This is kinda like a concrete version of the question Coming up with tokens for a lexer.

I'm writing a lexer for a small subset of HTML. I'm wondering what should I do when the input stream ends and I'm in a state where I've successfully recognized a token, but I know that it will be a syntax error.

I emphasise "I know" because this is the human me knowing, because I'm aware of the grammar rules which are "parser rules" (vs. "lexer rules"). I know that this is malformed: <b>hello</b, but there's nothing stopping the lexer from emitting the following.

Token: BEGIN-OPEN-TAG
Token: TAG-NAME           Value: b
Token: END-TAG
Token: DATA               Value: hello
Token: BEGIN-CLOSE-TAG    
Token: TAG-NAME           Value: b

Then parser would catch this as an error and report it. The reason I know that I can throw an error earlier is only because I'm aware of the parser and rules defined there. Do I get any benefit from marking this as invalid sequence of tokens, or should I try to keep such logic away from the lexer? When should a lexer emit an error anyway?

Should it allow <b<hello</b> then? How should a lexer handle a random < in the middle of the text: The \lt sign is <b><</b>? Backtracking? Or should I record it as [data] [<] [tagname] [>] [<] [</] [tagname] [>] and then let the parser know that [<] is valid in the middle of data?

The above are not questions which I expect an answer on, but more of a "if I decide on the question above, then it's an abyss of more blurred lines, which is why I'm having all these doubts". I'm having hard time deciding what should the lexer care about. If I make it care too much, I'm creating a parser at the same time. If I don't make it care enough, I'm pretty much making a "split at whitespace" procedure.

Best Answer

Your lexer is never going to be able to diagnose all syntax errors unless you make it as powerful as the parser itself. This would be a large and totally unnecessary amount of work, and the only benefit would be that illegal documents are recognized as illegal very slightly faster. That's not enough value for the high price.

Therefore you should keep your lexer as simple as possible, emitting primitives and not worrying its little head about syntax rules. Your code base will be much easier to understand if each component does one thing only.