Regular Expressions – How Do Regular Expressions Actually Work?

regular expressions

Say you have a document with an essay written. You want to parse this essay to only select certain words. Cool.

Is using a regular expression faster than parsing the file line by line and word by word looking for a match? If so, how does it work? How can you go faster than looking at each word?

Best Answer

How does it work?

Take a look at automata theory

In short, each regular expression has an equivalent finite automaton and can be compiled and optimized to a finite automaton. The involved algorithms can be found in many compiler books. These algorithms are used by unix programs like awk and grep.

However, most modern programming languages (Perl, Python, Ruby, Java (and JVM based languages), C#) do not use this approach. They use a recursive backtracking approach, which compiles a regular expression into a tree or a sequence of constructs representing various sub-chunks of the regular expression. Most modern "regular expression" syntaxes offer backreferences which are outside the group of regular languages (they have no representation in finite automata), which are trivially implementable in recursive backtracking approach.

The optimization does usually yield a more efficient state machine. For example: consider aaaab|aaaac|aaaad, a normal programmer can get the simple but less efficient search implementation (comparing three strings separately) right in ten minutes; but realizing it is equivalent to aaaa[bcd], a better search can be done by searching first four 'a' then test the 5th character against [b,c,d]. The process of optimization was one of my compiler home work many years ago so I assume it is also in most modern regular expression engines.

On the other hand, state machines do have some advantage when they are accepting strings because they use more space compared to a "trivial implementation". Consider a program to un-escape quotation on SQL strings, that is: 1) starts and ends with single quotation marks; 2) single quotation marks are escaped by two consecutive single quotations. So: input ['a'''] should yield output [a']. With a state machine, the consecutive single quotation marks are handled by two states. These two states serve the purpose of remembering the input history such that each input character is processed exactly only once, as the following illustrated:

...
S1->'->S2
S1->*->S1, output *, * can be any other character 
S2->'->S1, output '
S2->*->END, end the current string

So, in my opinion, regular expression may be slower in some trivial cases, but usually faster than a manually crafted search algorithm, given the fact that the optimization cannot be reliably done by human.

(Even in trivial cases like searching a string, a smart engine can recognize the single path in the state map and reduce that part to a simple string comparison and avoid managing states.)

A particular engine from a framework/library may be slow because the engine does a bunch of other things a programmer usually don't need. Example: the Regex class in .NET create a bunch of objects including Match, Groups and Captures.

Related Solutions

Readable Regular Expressions – Maintaining Power and Clarity

A number of people have mentioned composing from smaller parts, but no one's provided an example yet, so here's mine:

string number = "(\\d+)";
string unit = "(?:" + number + "\\s*:\\s*)";
string optionalDecimal = "(?:\\s*[.,]\\s*" + number + ")?";

Pattern re = Pattern.compile(
  "^\\s*(?:" + unit + "?" + unit + ")?" + number + optionalDecimal + "\\s*$"
);

Not the most readable, but I feel like it's clearer than the original.

Also, C# has the @ operator which can be prepended to a string in order to indicate that it is to be taken literally (no escape characters), so number would be @"([\d]+)";

Reviewing Object-Oriented Parser Models in Java

Data is now (after parsing) stored in just a list, searching for a specific peace of data is not very well optimized. Are there other structures I could use, which I can search using file(string) - level(integer) - sub-level(integer) so I can quickly get a specific data object.

Assuming that you want to search by File, Level, Sub-Level only, you have a clear hierarchical structure in that description, it sounds like you can divide one ArrayList into multiple steps. You could even make each hierarchy level a class itself to make it more clear, for example:

class SomeFileData {
   List<LevelData> levels;
}

class LevelData {
   List<SubLevelData> sublevels;
}

class SubLevelData {
   // Probably similar to your current `FileData` implementation
}

And if you need to save multiple files at once, you can use a Map with the file name as key.

class LotsOfFileData {
    Map<String, SomeFileData> files; 
}

Using this hierarchy, you can easily get file "mydata.dat", level 3, sublevel 4.

LotsOfFileData allData;
allData.getFile("mydata.dat").getLevel(3).getSublevel(4)

Each step in the get process looks at the current class' list or map to retrieve the data.

By the way, declare variables according to their interface, not their implementation. Declare your ArrayLists as List to allow for easy change of implementation.

Best Answer

Related Solutions

Readable Regular Expressions – Maintaining Power and Clarity

Reviewing Object-Oriented Parser Models in Java

Related Topic