Tokenizer Design – Is Using Regex for Token Gathering Appropriate?

lexerregular expressions

I have recently caught the 'Toy Language' bug, and have been experimenting with various simple tokenizer configurations. The most recent one, makes use of the boost.regex library to identify and get the value of tokens. It seems to me that regex would be the preferred way to go, when creating a tokenizer. My research has proved my assumption to be false however, or at least not completely true.

This is what a user from Stack Overflow had to say regarding a question about: Is it bad idea using regex to tokenize string for lexer?:

Using regular expressions is THE traditional way to generate your tokens.

After doing more research into this topic, I realized that even the most popular lexer generators use regex. For example, take the lexer generator Lex:

Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. – http://dinosaur.compilertools.net/lex/.


That lead me to draw the conclusion that regex is the usual, preferred way to create a tokenizer. After researching my topic once more, and trying to find a second opinion however, I came across this statement in response to this question on Stack Overflow:

The reason why people tell you that regular expressions aren't the best idea for a case like this is because regular expressions take more time to evaluate, and the regular expression language has many limitations and quirks that make it unsuitable for many applications. – user:670358

And he went on to say

Many compilers use a basic single-pass tokenization algorithm. The tokenizer has a very basic idea of what can be used as a delimiter, how constants and identifiers should be treated, etc. The tokenizer will then just iterate through the input quickly and emit a string of tokens that can then be easily parsed. – user:670358


This left me both confused and contemplating several things:

  • Is regex necessary for a hand built tokenizer. Or is it overkill?
  • If regex is not used often, then how extacly are the tokens parsed?

This may or may not opinion based, but I believe that there is a preferred method/accepted method among programmers.

Best Answer

Regexs work great for lexing / tokenization.

TL;DR

Using regular expressions to tokenize is entirely appropriate. The default approach, really. As for efficiency, regexs traditionally map directly to finite state machines. They're about as simple and efficient as you can get for syntax definitions of any generality whatsoever.

fsm graphic

Modern regex engines aren't pure mathematical FSM implementations, having been extended with features like look-ahead, look-behind, and backtracking. But they have a strong theoretical foundation, and in practice are highly optimized and extremely well vetted.

Much of the last fifty-plus years' of computer language parsing boils down to finding techniques to detangle the process and make it practical. Divide and conquer / layering is common. Thus the idea of splitting the language understanding problem into a "lexing" lower level and "parsing" upper level.

The same with finding strength-reducing approaches like using only subsets of context-free and ambiguity-free grammars. Pascal was limited to what could be parsed recursive-descent, and Python is famously restricted to LL(1). There are whole alphabet soups of LL, LR, SLR, LALR, etc. language grammars / parser families. Almost all implemented language designs are carefully constrained by the parsing techniques they use. Perl is the only major language I can think of that isn't so constrained. This dance is described in the "Dragon book(s)" that were the most common "how to language" textbooks for generations.

The strict lexing/parsing split and 'use only subsets of unambiguous, context-free grammars' rules are softening. Lexical understanding is now sometimes not split off as an entirely different layer, and most systems have enough CPU power and memory to make that feasible. Another answer mentioned PEG parsers. That starts to break the orthodoxy of language families. Even wider afield you can see renewed interest in more general parsers/grammars like the Earley parser which go beyond the limited look-aheads of the LL/LR aristocracies. Recent implementations and refinements (e.g. Marpa) show that, on modern hardware, there really is no barrier to generalized parsing.

All that said, however, infinite freedom (or even much greater freedom) is not necessarily a good thing. The mechanical, practical, and technique restrictions of any art form--writing, painting, sculpting, film-making, coding, etc.--often require a discipline of approach that is useful beyond matching available implementation techniques. Would Python, for instance, be greatly improved by generalizing beyond LL(1) parsing? It's not clear that it would. Sure there are a few unfortunate inconsistencies and limitations, and it needs that significant whitespace. But it also stays clean and consistent, across a vast number of developers and uses, partially as a result of those restrictions. Don't do the language equivalent of what happened when different type faces, sizes, colors, background colors, variations, and decorations became widely available in word processors and email. Don't use all the options profusely and indiscriminately. That's not good design.

So while large generality and even ambiguity are now open to you as you implement your toy language, and while you can certainly use one of the newly fashionable PEG or Earley approaches, unless you're writing something mimicking natural human language, you probably don't need to. Standard lexing and parsing approaches would suffice. Or, long story short, regexs are fine.

Related Topic