Lexer Testing – How to Test a Lexer Effectively

lexerparsingtestingtokensunit testing

I'm wondering how to effectively test a lexer (tokenizer).

The number of combinations of tokens in a source file can be huge, and the only way I've found is to make a batch of representative source files and expect an specific sequence of tokens for each of them.

Best Answer

Your grammar probably has some rules for each token on how it can be produced (for example, that a { signifies a BLOCK_START token, or that a string-literal token is delimited by '"' characters). Start writing tests for those rules and verify that your lexer produces the correct token in each case.

Once you have a test for each single token, you can add a few tests for interesting combinations of tokens. Focus here on token combinations that would reveal an error in your lexer. The token combinations don't have to make sense to a parser for your language, so it is entirely valid to use +++++12 as input and expect the tokens INCREMENT, INCREMENT, PLUS, INTEGER_LITERAL(12) as output.

And finally, make sure you have some tests for faulty inputs, where the lexer will not be able to recognize a token. And although I mention them last, they certainly don't have to be the last tests you create. You could just as well start with these.

Related Topic