Programming Languages – Applying Lexer Rules Universally

lexerprogramming-languages

I'm trying to understand the theory behind a lexer with the purpose of building one (just for my own fun and experience and to compensate for not taking proper CS courses :)).

What I have yet to understand is if lexer theory is the same no matter what language you are analyzing? Is it just about splitting white space then try to figure out what each text represents or is it more than that?

In other words, does the generation of tokens depend on the language being analyzed or is it just about white space? Can you apply the same lexer rules to all programming languages? For example if I have a lexer that tokenizes Java code can I use it for Python or not, since whitespace in Python has other meaning?

For example, I found a nice Python project (Pygments) which seems to provide a working core that allows you to later plugin rules for each of your favorite language (what are the keywords, the comments etc).

Best Answer

No you can't. Here is a nonsense snippet of Lisp:

(*foo-bar* 'baz 14)

What are the tokens in this snippet? The answer is something like

LPAREN
SYMBOL  "*foo-bar*"
QUOTE
SYMBOL  "baz"
INTEGER "14"
RPAREN

Lisps have very liberal rules what can be inside an identifer: hyphens -, asterisks * and many other characters are allowed. Expressions can be quoted which prevents their evaluation, in many dialects this is done by prepending a ' to that expression.

A lexer needs to be aware of the language that it's parsing. If we would have used a C language lexer on the above code, we would get an error because ' introduces a character literal which also requires a closing ' – none is present here. Our C tokenizer might produce:

LPAREN
OP_STAR
IDENTIFIER "foo"
OP_MINUS
IDENTIFIER "bar"
OP_STAR
ERROR character literal isn't terminated: expected »'« after »'b«

For each language we want to lex, we have to write special rules. Every language is unique, and many languages do have subtle differences. Here, the identifiers are so very different that the token streams have lost any resemblance of each other. However, some languages inherit syntax from each other. For a very simple snippet like foo.bar(baz), a lexer written for C++ and one for Python might produce comparable results.

But why do we actually use Lexers?

Formally, lexers are not needed. We can describe a language without them just fine. Lexers are a performance optimization. Lexers are basically very restricted preprocessors for a parser that can match the input very efficiently (e.g. implemented as a state machine). The lexer then emits tokens, which are larger building blocks (a token is a pairing of a type or ID with a string). It is then the job of the parser to combine these tokens in a hierarchical structure (generally, an abstract syntax tree). This frees the parser from doing character-level operations, which makes it more efficient and easier to maintain (I have written parser without this separation. It works, but is annoying).

And here we have another problem: A lexer is written for a specific parser. The two work in tandem and cannot be recombined willy-nilly with other lexers or parsers. In the above token streams I have called the "(" operator LPAREN. But a different grammar might require a OP_LP symbol instead.

So for a variety of reasons, it's impossible to write one lexer to rule them all.

Each language needs it's own lexer. But we can take shortcuts and use programs that generate specialized lexers for us from some specification. Often, tools like lex are used for this. If performance is not an issue, I use the regular expressions library from the host language to cobble together a simple lexer, e.g. in Perl:

# a primitive Lisp lexer
my %tokens = (
  LPAREN => qr/[(]/,
  RPAREN => qr/[)]/,
  SYMBOL => qr/[^\s()']+/,
  QUOTE  => qr/[']/,
  INTEGER => qr/[1-9][0-9]*/,
);
my $space = qr/\s+/;

my $input = "(*foo-bar* 'baz 14)";
my $pos = 0;
my $length = length $input;

POSITION:
while ($pos < $length) {
  pos($input) = $pos;
  $pos += length $1 and next if $input =~ /\G($space)/gc;

  for my $name (keys %tokens) {
    if ($input =~ /\G($tokens{$name})/gc) {
      print "$name -- $1\n";
      $pos += length $1;
      next POSITION;
    }
  }

  die "No token matched, just before:\n", (substr $input, $pos, 20), "\n";
}

Output:

LPAREN -- (
SYMBOL -- *foo-bar*
QUOTE -- '
SYMBOL -- baz
SYMBOL -- 14
RPAREN -- )

But we can easily change the token definitions:

# an incomplete C lexer
my %tokens = (
  LPAREN => qr/[(]/,
  RPAREN => qr/[)]/,
  IDENTIFIER => qr/[a-zA-Z_][a-zA-Z0-9_]*/,
  CHARACTER  => qr/['](?:[^'\\]|\\.)[']/,
  INTEGER => qr/[1-9][0-9]*/,
  OP_STAR => qr/[*]/,
  OP_MINUS => qr/[-]/,
);
my $space = qr/\s+/;

which changes the output to:

LPAREN -- (
OP_STAR -- *
IDENTIFIER -- foo
OP_MINUS -- -
IDENTIFIER -- bar
OP_STAR -- *
No token matched, just before:
'baz 14)

We didn't write a new lexer, we just updated the token definitions.

What to do now:

  • As a preparation, understand and implement the Shunting-Yard Algorithm. This will teach you about a simple stack machine.

  • Read up on grammars and languages in computer science. Learn what a rule is.

  • Read up on Extended Backus-Naur Form notation, even if you don't immediately understand it. This is a way to write down grammars, and grammars are a way to describe languages (e.g. programming languages).

  • Understand the implications of the Chomsky Hierarchy. Here are the three important levels:

    • Regular Languages can be matched very efficiently. Lexers usually operate on this level.
    • Context-Free Languages are the basis for the description of many programming languages. Various parsing strategies can match various subsets. You will want to read up on LL, LR, and PEG at least.
    • Context-Sensitive Languages are even more general, but they are vastly more difficult to parse. Note that many regex libraries like PCRE have context-sensitive features.

    Note that real programming languages are not context-free – semantic whitespace as in Python, here-docs as in Bash, and optional end tags in HTML are there to mess up most elegant descriptions (but tricks exist to fake a context-free grammar even then).

  • Write a simple recursive-descent parser yourself that can evaluate simple arithmetics like 1 + 2 * 3 and manages operator precedence correctly. Using a lexer is not necessary for this. Add parens to your language, then some functions like abs or sin. Post your solution on codereview to get tips on how to do it better.

  • Now that you've learned how to parse stuff yourself, you may enjoy helping tools like parser generators. As an exercise, pick a simple language like JSON and write a parser for it with your tool of choice. Return to codereview for criticism.

If you have any problems on the way, ask here on programmers or over on stackoverflow, depending on whether the question is conceptual or implementation-related.