Lexical Analysis without regular expressions

lexerregular expressionstheory

I've been looking at a few lexers in various higher level langauges (Python, PHP, Javascript among others) and they all seem to use regular expressions in one form or another. While I'm sure regex's are probably the best way to do this, I was wondering if there was any way to achieve basic lexing without regular expressions, maybe some sort of direct string parsing or something.

So yeah, is it possible to implement some sort of basic lexing in a higher level language* without using regular expressions in any form?

_{*Higher level languages being things like Perl/PHP/Python/Javascript etc. I'm sure there is a way to do it in C}

Best Answer

First of all, there have been regular expression libraries for C since before your "higher-level" languages were invented. Just saying, C programs aren't as podunk as some people seem to think.

For most grammars, lexing is a matter of searching for whitespace and a few other characters like ()[]{}; to split the words, and then matching against a list of keywords to see if any match.

Related Solutions

Using lookahead assertions in regular expressions

I still find lookahead and lookbehind to be terribly confusing and often unreadable.

You're aware that regular expressions can be exploded and commented, right?

$foo =~ m/^
  (?=.*a)           # must contain an a somewhere
  (?=.*c)           # must contain a c somewhere
  (?=.*1)           # must contain a 1 somewhere
  (?=.*2)           # must contain a 2 somewhere
  \S+               # all non-space characters
$/x

Is it good practice to use lookahead/lookbehind in regular expressions, or are they simply a hack that have found their way into modern production code?

They are quite indispensable, to avoid catastrophic backtracking and regex-related security issues. Ideally use plain atomic groups as well.

Compare how the above expression will backtrack, as compared with the naive equivalent:

$foo =~ m/^
  \S*a\S*c\S*1\S*2\S*      # a, then c, then 1, then 2
 |
  \S*a\S*c\S*2\S*1\S*      # a, c, 2, 1
 |
  \S*a\S*1\S*c\S*2\S*      # a, 1, c, 2
 |
  \S*a\S*1\S*2\S*c\S*      # a, 1, 2, c
 |
  # ... etc
$/x

Especially with a long input and a random sequence of a, c and 2 (no 1).

Regular Expressions – How Do Regular Expressions Actually Work?

How does it work?

Take a look at automata theory

In short, each regular expression has an equivalent finite automaton and can be compiled and optimized to a finite automaton. The involved algorithms can be found in many compiler books. These algorithms are used by unix programs like awk and grep.

However, most modern programming languages (Perl, Python, Ruby, Java (and JVM based languages), C#) do not use this approach. They use a recursive backtracking approach, which compiles a regular expression into a tree or a sequence of constructs representing various sub-chunks of the regular expression. Most modern "regular expression" syntaxes offer backreferences which are outside the group of regular languages (they have no representation in finite automata), which are trivially implementable in recursive backtracking approach.

The optimization does usually yield a more efficient state machine. For example: consider aaaab|aaaac|aaaad, a normal programmer can get the simple but less efficient search implementation (comparing three strings separately) right in ten minutes; but realizing it is equivalent to aaaa[bcd], a better search can be done by searching first four 'a' then test the 5th character against [b,c,d]. The process of optimization was one of my compiler home work many years ago so I assume it is also in most modern regular expression engines.

On the other hand, state machines do have some advantage when they are accepting strings because they use more space compared to a "trivial implementation". Consider a program to un-escape quotation on SQL strings, that is: 1) starts and ends with single quotation marks; 2) single quotation marks are escaped by two consecutive single quotations. So: input ['a'''] should yield output [a']. With a state machine, the consecutive single quotation marks are handled by two states. These two states serve the purpose of remembering the input history such that each input character is processed exactly only once, as the following illustrated:

...
S1->'->S2
S1->*->S1, output *, * can be any other character 
S2->'->S1, output '
S2->*->END, end the current string

So, in my opinion, regular expression may be slower in some trivial cases, but usually faster than a manually crafted search algorithm, given the fact that the optimization cannot be reliably done by human.

(Even in trivial cases like searching a string, a smart engine can recognize the single path in the state map and reduce that part to a simple string comparison and avoid managing states.)

A particular engine from a framework/library may be slow because the engine does a bunch of other things a programmer usually don't need. Example: the Regex class in .NET create a bunch of objects including Match, Groups and Captures.

Best Answer

Related Solutions

Using lookahead assertions in regular expressions

Regular Expressions – How Do Regular Expressions Actually Work?

Related Topic