Lexer – Best Datatype for Tokens Returned by a Lexer

data typesflexlexer

As said in the title, which data type should a lexer return/give the parser? When reading the lexical analysis article that Wikipedia has, it stated that:

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning").

However, in complete contradiction to the above statement, When another question I asked on a different site(Code Review if you are curious) was answered, The person answering stated that:

The lexer usually reads the string and converts this into a stream … of lexemes. The lexemes only need to be a stream of numbers.

and he gave this visual:

nl_output => 256
output    => 257
<string>  => 258

Later on in the article He mentioned Flex, a already existing lexer, and said writing 'rules' with it would be simpler than writing a lexer by hand. He proceeded to give me this example:

Space              [ \r\n\t]
QuotedString       "[^"]*"
%%
nl_output          {return 256;}
output             {return 257;}
{QuotedString}     {return 258;}
{Space}            {/* Ignore */}
.                  {error("Unmatched character");}
%%

To further my insight and get more information, I read the Wikipedia article about Flex. the Flex article showed that you could define a set of syntax rules, with tokens, in the following way:

digit         [0-9]
letter        [a-zA-Z]

%%
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"."                  { return PERIOD;     }
":="                 { return BECOMES;    }
"="                  { return EQL;        }
"<>"                 { return NEQ;        }
"<"                  { return LSS;        }
">"                  { return GTR;        }
"<="                 { return LEQ;        }
">="                 { return GEQ;        }
"begin"              { return BEGINSYM;   }
"call"               { return CALLSYM;    }
"const"              { return CONSTSYM;   }
"do"                 { return DOSYM;      }
"end"                { return ENDSYM;     }
"if"                 { return IFSYM;      }
"odd"                { return ODDSYM;     }
"procedure"          { return PROCSYM;    }
"then"               { return THENSYM;    }
"var"                { return VARSYM;     }
"while"              { return WHILESYM;   }

It seems to me that the Flex lexer is returning strings of keywords\tokens. But It could be returning constants that are equal to certain numbers.

If the lexer was going to return numbers, how would it read string literals? returning a number is fine for single keywords, But how would you deal with a string? Wouldn't the lexer have to convert the string to binary numbers and then the parser would convert the numbers back to a string. It seems much more logical(and easier) for the lexer to return strings, and then let the parser convert any number string literals into actual numbers.

Or could the lexer possible return both? I've been trying to write a simple lexer in c++, which lets you have only one return type for your functions. Thus leading me to ask my question.

To condense my question into a paragraph: When writing a lexer, and assuming that it could only return one data type(strings or numbers), which would be the more logical choice?

Best Answer

Generally, if you're processing a language though lexing and parsing, you've got a definition of your lexical tokens, e.g.:

NUMBER ::= [0-9]+
ID     ::= [a-Z]+, except for keywords
IF     ::= 'if'
LPAREN ::= '('
RPAREN ::= ')'
COMMA  ::= ','
LBRACE ::= '{'
RBRACE ::= '}'
SEMICOLON ::= ';'
...

and you have a grammar for the parser:

STATEMENT ::= IF LPAREN EXPR RPAREN STATEMENT
            | LBRACE STATEMENT BRACE
            | EXPR SEMICOLON
EXPR      ::= ID
            | NUMBER
            | ID LPAREN EXPRS RPAREN
...

Your lexer takes the input stream and produces a stream of tokens. The stream of tokens is consumed by the parser to produce a parse tree. In some cases, just knowing the type of the token is enough (e.g, LPAREN, RBRACE, FOR), but in some cases, you'll need the actual value that is associated with the token. For instance, when you encounter an ID token, you'll want the actual characters that make up the ID later on when you're trying to figure out what identifier you're trying to reference.

So, you typically have something more or less like this:

enum TokenType {
  NUMBER, ID, IF, LPAREN, RPAREN, ...;
}

class Token {
  TokenType type;
  String value;
}

So when the lexer returns a token, you know what type it is (which you need for the parsing), and the sequence of characters that it was generated from (which you'll need later on to interpret string and numeric literals, identifiers, etc.). It might feel like you're returning two values, since you're returning a very simple aggregate type, but you really do need both parts. After all, you'd want to treat the following programs differently:

if (2 > 0) {
  print("2 > 0");
}

if (0 > 2) {
  print("0 > 2");
}

These produce the same sequence of token types: IF, LPAREN, NUMBER, GREATER_THAN, NUMBER, RPAREN, LBRACE, ID, LPAREN, STRING, RPAREN, SEMICOLON, RBRACE. That means they parse the same, too. But when you're actually doing something with the parse tree, you'll care that the value of the first number is '2' (or '0') and that the value of the second number is '0' (or '2'), and that the value of the string is '2 > 0' (or '0 > 2').

Related Solutions

Python – Creating Tokens for a Lexer

Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.

The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.

This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are

This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.

Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.

To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.

You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.

Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.

As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.

There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.

Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.

Programming Languages – Applying Lexer Rules Universally

No you can't. Here is a nonsense snippet of Lisp:

(*foo-bar* 'baz 14)

What are the tokens in this snippet? The answer is something like

LPAREN
SYMBOL  "*foo-bar*"
QUOTE
SYMBOL  "baz"
INTEGER "14"
RPAREN

Lisps have very liberal rules what can be inside an identifer: hyphens -, asterisks * and many other characters are allowed. Expressions can be quoted which prevents their evaluation, in many dialects this is done by prepending a ' to that expression.

A lexer needs to be aware of the language that it's parsing. If we would have used a C language lexer on the above code, we would get an error because ' introduces a character literal which also requires a closing ' – none is present here. Our C tokenizer might produce:

LPAREN
OP_STAR
IDENTIFIER "foo"
OP_MINUS
IDENTIFIER "bar"
OP_STAR
ERROR character literal isn't terminated: expected »'« after »'b«

For each language we want to lex, we have to write special rules. Every language is unique, and many languages do have subtle differences. Here, the identifiers are so very different that the token streams have lost any resemblance of each other. However, some languages inherit syntax from each other. For a very simple snippet like foo.bar(baz), a lexer written for C++ and one for Python might produce comparable results.

But why do we actually use Lexers?

Formally, lexers are not needed. We can describe a language without them just fine. Lexers are a performance optimization. Lexers are basically very restricted preprocessors for a parser that can match the input very efficiently (e.g. implemented as a state machine). The lexer then emits tokens, which are larger building blocks (a token is a pairing of a type or ID with a string). It is then the job of the parser to combine these tokens in a hierarchical structure (generally, an abstract syntax tree). This frees the parser from doing character-level operations, which makes it more efficient and easier to maintain (I have written parser without this separation. It works, but is annoying).

And here we have another problem: A lexer is written for a specific parser. The two work in tandem and cannot be recombined willy-nilly with other lexers or parsers. In the above token streams I have called the "(" operator LPAREN. But a different grammar might require a OP_LP symbol instead.

So for a variety of reasons, it's impossible to write one lexer to rule them all.

Each language needs it's own lexer. But we can take shortcuts and use programs that generate specialized lexers for us from some specification. Often, tools like lex are used for this. If performance is not an issue, I use the regular expressions library from the host language to cobble together a simple lexer, e.g. in Perl:

# a primitive Lisp lexer
my %tokens = (
  LPAREN => qr/[(]/,
  RPAREN => qr/[)]/,
  SYMBOL => qr/[^\s()']+/,
  QUOTE  => qr/[']/,
  INTEGER => qr/[1-9][0-9]*/,
);
my $space = qr/\s+/;

my $input = "(*foo-bar* 'baz 14)";
my $pos = 0;
my $length = length $input;

POSITION:
while ($pos < $length) {
  pos($input) = $pos;
  $pos += length $1 and next if $input =~ /\G($space)/gc;

  for my $name (keys %tokens) {
    if ($input =~ /\G($tokens{$name})/gc) {
      print "$name -- $1\n";
      $pos += length $1;
      next POSITION;
    }
  }

  die "No token matched, just before:\n", (substr $input, $pos, 20), "\n";
}

Output:

LPAREN -- (
SYMBOL -- *foo-bar*
QUOTE -- '
SYMBOL -- baz
SYMBOL -- 14
RPAREN -- )

But we can easily change the token definitions:

# an incomplete C lexer
my %tokens = (
  LPAREN => qr/[(]/,
  RPAREN => qr/[)]/,
  IDENTIFIER => qr/[a-zA-Z_][a-zA-Z0-9_]*/,
  CHARACTER  => qr/['](?:[^'\\]|\\.)[']/,
  INTEGER => qr/[1-9][0-9]*/,
  OP_STAR => qr/[*]/,
  OP_MINUS => qr/[-]/,
);
my $space = qr/\s+/;

which changes the output to:

LPAREN -- (
OP_STAR -- *
IDENTIFIER -- foo
OP_MINUS -- -
IDENTIFIER -- bar
OP_STAR -- *
No token matched, just before:
'baz 14)

We didn't write a new lexer, we just updated the token definitions.

What to do now:

As a preparation, understand and implement the Shunting-Yard Algorithm. This will teach you about a simple stack machine.
Read up on grammars and languages in computer science. Learn what a rule is.
Read up on Extended Backus-Naur Form notation, even if you don't immediately understand it. This is a way to write down grammars, and grammars are a way to describe languages (e.g. programming languages).
Understand the implications of the Chomsky Hierarchy. Here are the three important levels:
- Regular Languages can be matched very efficiently. Lexers usually operate on this level.
- Context-Free Languages are the basis for the description of many programming languages. Various parsing strategies can match various subsets. You will want to read up on LL, LR, and PEG at least.
- Context-Sensitive Languages are even more general, but they are vastly more difficult to parse. Note that many regex libraries like PCRE have context-sensitive features.
Note that real programming languages are not context-free – semantic whitespace as in Python, here-docs as in Bash, and optional end tags in HTML are there to mess up most elegant descriptions (but tricks exist to fake a context-free grammar even then).
Write a simple recursive-descent parser yourself that can evaluate simple arithmetics like 1 + 2 * 3 and manages operator precedence correctly. Using a lexer is not necessary for this. Add parens to your language, then some functions like abs or sin. Post your solution on codereview to get tips on how to do it better.
Now that you've learned how to parse stuff yourself, you may enjoy helping tools like parser generators. As an exercise, pick a simple language like JSON and write a parser for it with your tool of choice. Return to codereview for criticism.

If you have any problems on the way, ask here on programmers or over on stackoverflow, depending on whether the question is conceptual or implementation-related.