Lexer – How Does It Handle Template Strings?

lexerparsing

So lexers are supposed to emit tokens for key structures like INDENT and DEDENT for indentation stuff, or these:

NUMBER ::= [0-9]+
ID     ::= [a-Z]+, except for keywords
IF     ::= 'if'
LPAREN ::= '('
RPAREN ::= ')'
COMMA  ::= ','
LBRACE ::= '{'
RBRACE ::= '}'
SEMICOLON ::= ';'

What does it do for template strings like in JavaScript:

`I am a ${simple} template string!`
`I am a ${complex ? `nested ${simple}` : 'basic'} template string!`

What are the lexer tokens for such strings? How does it handle the regular part of the string vs. the variable part, and how generally does the lexer parse this code to emit its tokens?

The template string has both string components, and arbitrarily deeply nested recursion of code/template/code/template/etc. How does the lexer know what to do?

Best Answer

Two typical solutions:

  • give up on using a separate lexer. This is easy and efficient with top-down parsing approaches such as recursive descent, PEG, or parser combinators. Such an approach makes it natural to embed languages within each other, e.g. JavaScript code containing strings containing JavaScript expressions.

  • yield multiple tokens per template string. For example, the input `I am a ${simple} template string!` might be tokenized as:

    • template string start "I am a "
    • identifier simple
    • template string end " template string!"

    where the tokens for template strings would roughly be defined as follows using regex notation:

    <template string start> = /`.*?\$\{/
    <template string middle> = /}.*?\$\{/
    <template string end> = /}.*?`/
    <template string simple> = /`.*?`/
    

    and the grammar for template strings might be:

    <template string> = <template string simple>
                      | <template string start> <expression> <template string rest>
    <template string rest> = <template string middle> <expression> <template string rest>
                           | <template string end>
    

    Nevertheless, this tokenization approach is not particularly good. For example, the input {};`//foo` could be tokenized as open and closing braces, semicolon, simple template string "//foo", or as opening brace, template string end ";" followed by a comment. Thus, the correct tokenization would depend on the parser state. While this state can be discovered for some parser types, a top-down parsing approach without a separate tokenization phase is a much more natural way to deal with nested languages.