Tokenizer Design – Is Using Regex for Token Gathering Appropriate?

lexerregular expressions

I have recently caught the 'Toy Language' bug, and have been experimenting with various simple tokenizer configurations. The most recent one, makes use of the boost.regex library to identify and get the value of tokens. It seems to me that regex would be the preferred way to go, when creating a tokenizer. My research has proved my assumption to be false however, or at least not completely true.

This is what a user from Stack Overflow had to say regarding a question about: Is it bad idea using regex to tokenize string for lexer?:

Using regular expressions is THE traditional way to generate your tokens.

After doing more research into this topic, I realized that even the most popular lexer generators use regex. For example, take the lexer generator Lex:

Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. – http://dinosaur.compilertools.net/lex/.

That lead me to draw the conclusion that regex is the usual, preferred way to create a tokenizer. After researching my topic once more, and trying to find a second opinion however, I came across this statement in response to this question on Stack Overflow:

The reason why people tell you that regular expressions aren't the best idea for a case like this is because regular expressions take more time to evaluate, and the regular expression language has many limitations and quirks that make it unsuitable for many applications. – user:670358

And he went on to say

Many compilers use a basic single-pass tokenization algorithm. The tokenizer has a very basic idea of what can be used as a delimiter, how constants and identifiers should be treated, etc. The tokenizer will then just iterate through the input quickly and emit a string of tokens that can then be easily parsed. – user:670358

This left me both confused and contemplating several things:

Is regex necessary for a hand built tokenizer. Or is it overkill?
If regex is not used often, then how extacly are the tokens parsed?

This may or may not opinion based, but I believe that there is a preferred method/accepted method among programmers.

Best Answer

Regexs work great for lexing / tokenization.

TL;DR

Using regular expressions to tokenize is entirely appropriate. The default approach, really. As for efficiency, regexs traditionally map directly to finite state machines. They're about as simple and efficient as you can get for syntax definitions of any generality whatsoever.

Modern regex engines aren't pure mathematical FSM implementations, having been extended with features like look-ahead, look-behind, and backtracking. But they have a strong theoretical foundation, and in practice are highly optimized and extremely well vetted.

Much of the last fifty-plus years' of computer language parsing boils down to finding techniques to detangle the process and make it practical. Divide and conquer / layering is common. Thus the idea of splitting the language understanding problem into a "lexing" lower level and "parsing" upper level.

The same with finding strength-reducing approaches like using only subsets of context-free and ambiguity-free grammars. Pascal was limited to what could be parsed recursive-descent, and Python is famously restricted to LL(1). There are whole alphabet soups of LL, LR, SLR, LALR, etc. language grammars / parser families. Almost all implemented language designs are carefully constrained by the parsing techniques they use. Perl is the only major language I can think of that isn't so constrained. This dance is described in the "Dragon book(s)" that were the most common "how to language" textbooks for generations.

The strict lexing/parsing split and 'use only subsets of unambiguous, context-free grammars' rules are softening. Lexical understanding is now sometimes not split off as an entirely different layer, and most systems have enough CPU power and memory to make that feasible. Another answer mentioned PEG parsers. That starts to break the orthodoxy of language families. Even wider afield you can see renewed interest in more general parsers/grammars like the Earley parser which go beyond the limited look-aheads of the LL/LR aristocracies. Recent implementations and refinements (e.g. Marpa) show that, on modern hardware, there really is no barrier to generalized parsing.

All that said, however, infinite freedom (or even much greater freedom) is not necessarily a good thing. The mechanical, practical, and technique restrictions of any art form--writing, painting, sculpting, film-making, coding, etc.--often require a discipline of approach that is useful beyond matching available implementation techniques. Would Python, for instance, be greatly improved by generalizing beyond LL(1) parsing? It's not clear that it would. Sure there are a few unfortunate inconsistencies and limitations, and it needs that significant whitespace. But it also stays clean and consistent, across a vast number of developers and uses, partially as a result of those restrictions. Don't do the language equivalent of what happened when different type faces, sizes, colors, background colors, variations, and decorations became widely available in word processors and email. Don't use all the options profusely and indiscriminately. That's not good design.

So while large generality and even ambiguity are now open to you as you implement your toy language, and while you can certainly use one of the newly fashionable PEG or Earley approaches, unless you're writing something mimicking natural human language, you probably don't need to. Standard lexing and parsing approaches would suffice. Or, long story short, regexs are fine.

Related Solutions

Python – Creating Tokens for a Lexer

Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.

The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.

This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are

This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.

Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.

To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.

You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.

Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.

As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.

There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.

Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.

Can CSV Format Be Defined by Regex?

Nice in theory, terrible in practice

By CSV I'm going to assume you mean the convention as described in RFC 4180.

While matching basic CSV data is trivial:

"data", "more data"

Note: BTW, it's a lot more efficient to use a .split('/n').split('"') function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.

For example here's the most comprehensive regular expression matching string I've found:

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
\s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\\]*(?:\\[\S\s][^'\\]*)*'     # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*"     # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  \s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\\]*(?:\\[\S\s][^'\\]*)*'   # Either Single quoted string,
  | "[^"\\]*(?:\\[\S\s][^"\\]*)*"   # or Double quoted string,
  | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  \s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.

Source: Stack Overflow - How can I parse a string with JavaScript

It's becomes a nightmare once the common edge-cases are introduced like...

"such as ""escaped""","data"
"values that contain /n newline chars",""
"escaped, commas, like",",these"
"un-delimited data like", this
"","empty values"
"empty trailing values",        // <- this is completely valid
                                // <- trailing newline, may or may not be included

The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only 'reasonable' alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.

Source: Experience otherwise known as extensive pain and suffering.

I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.

tl;dr - Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it's useful for tokenizing terminal and non-terminal strings.

Best Answer

Related Solutions

Python – Creating Tokens for a Lexer

Can CSV Format Be Defined by Regex?

Related Topic