Lexer – Best Datatype for Tokens Returned by a Lexer

data typesflexlexer

As said in the title, which data type should a lexer return/give the parser? When reading the lexical analysis article that Wikipedia has, it stated that:

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning").

However, in complete contradiction to the above statement, When another question I asked on a different site(Code Review if you are curious) was answered, The person answering stated that:

The lexer usually reads the string and converts this into a stream … of lexemes. The lexemes only need to be a stream of numbers.

and he gave this visual:

nl_output => 256
output    => 257
<string>  => 258

Later on in the article He mentioned Flex, a already existing lexer, and said writing 'rules' with it would be simpler than writing a lexer by hand. He proceeded to give me this example:

Space              [ \r\n\t]
QuotedString       "[^"]*"
%%
nl_output          {return 256;}
output             {return 257;}
{QuotedString}     {return 258;}
{Space}            {/* Ignore */}
.                  {error("Unmatched character");}
%%

To further my insight and get more information, I read the Wikipedia article about Flex. the Flex article showed that you could define a set of syntax rules, with tokens, in the following way:

digit         [0-9]
letter        [a-zA-Z]

%%
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"."                  { return PERIOD;     }
":="                 { return BECOMES;    }
"="                  { return EQL;        }
"<>"                 { return NEQ;        }
"<"                  { return LSS;        }
">"                  { return GTR;        }
"<="                 { return LEQ;        }
">="                 { return GEQ;        }
"begin"              { return BEGINSYM;   }
"call"               { return CALLSYM;    }
"const"              { return CONSTSYM;   }
"do"                 { return DOSYM;      }
"end"                { return ENDSYM;     }
"if"                 { return IFSYM;      }
"odd"                { return ODDSYM;     }
"procedure"          { return PROCSYM;    }
"then"               { return THENSYM;    }
"var"                { return VARSYM;     }
"while"              { return WHILESYM;   }

It seems to me that the Flex lexer is returning strings of keywords\tokens. But It could be returning constants that are equal to certain numbers.

If the lexer was going to return numbers, how would it read string literals? returning a number is fine for single keywords, But how would you deal with a string? Wouldn't the lexer have to convert the string to binary numbers and then the parser would convert the numbers back to a string. It seems much more logical(and easier) for the lexer to return strings, and then let the parser convert any number string literals into actual numbers.

Or could the lexer possible return both? I've been trying to write a simple lexer in c++, which lets you have only one return type for your functions. Thus leading me to ask my question.

To condense my question into a paragraph: When writing a lexer, and assuming that it could only return one data type(strings or numbers), which would be the more logical choice?

Best Answer

Generally, if you're processing a language though lexing and parsing, you've got a definition of your lexical tokens, e.g.:

NUMBER ::= [0-9]+
ID     ::= [a-Z]+, except for keywords
IF     ::= 'if'
LPAREN ::= '('
RPAREN ::= ')'
COMMA  ::= ','
LBRACE ::= '{'
RBRACE ::= '}'
SEMICOLON ::= ';'
...

and you have a grammar for the parser:

STATEMENT ::= IF LPAREN EXPR RPAREN STATEMENT
            | LBRACE STATEMENT BRACE
            | EXPR SEMICOLON
EXPR      ::= ID
            | NUMBER
            | ID LPAREN EXPRS RPAREN
...

Your lexer takes the input stream and produces a stream of tokens. The stream of tokens is consumed by the parser to produce a parse tree. In some cases, just knowing the type of the token is enough (e.g, LPAREN, RBRACE, FOR), but in some cases, you'll need the actual value that is associated with the token. For instance, when you encounter an ID token, you'll want the actual characters that make up the ID later on when you're trying to figure out what identifier you're trying to reference.

So, you typically have something more or less like this:

enum TokenType {
  NUMBER, ID, IF, LPAREN, RPAREN, ...;
}

class Token {
  TokenType type;
  String value;
}

So when the lexer returns a token, you know what type it is (which you need for the parsing), and the sequence of characters that it was generated from (which you'll need later on to interpret string and numeric literals, identifiers, etc.). It might feel like you're returning two values, since you're returning a very simple aggregate type, but you really do need both parts. After all, you'd want to treat the following programs differently:

if (2 > 0) {
  print("2 > 0");
}
if (0 > 2) {
  print("0 > 2");
}

These produce the same sequence of token types: IF, LPAREN, NUMBER, GREATER_THAN, NUMBER, RPAREN, LBRACE, ID, LPAREN, STRING, RPAREN, SEMICOLON, RBRACE. That means they parse the same, too. But when you're actually doing something with the parse tree, you'll care that the value of the first number is '2' (or '0') and that the value of the second number is '0' (or '2'), and that the value of the string is '2 > 0' (or '0 > 2').

Related Topic