Readable Regular Expressions – Maintaining Power and Clarity

regular expressions

Many programmers know the joy of whipping up a quick regular expression, these days often with help of some web service, or more traditionally at interactive prompt, or perhaps writing a small script which has the regular expression under development, and a collection of test cases. In either case the process is iterative and fairly quick: keep hacking at the cryptic-looking string until it matches and captures what you want and will reject what you don't want.

For a simple case result might be something like this, as a Java regexp:

Pattern re = Pattern.compile(
  "^\\s*(?:(?:([\\d]+)\\s*:\\s*)?(?:([\\d]+)\\s*:\\s*))?([\\d]+)(?:\\s*[.,]\\s*([0-9]+))?\\s*$"
);

Many programmers also know the pain of needing to edit a regular expression, or just code around a regular expression in a legacy code base. With a bit editing to split it up, above regexp is still very easy to comprehend for anyone reasonably familiar with regexps, and a regexp veteran should see right away what it does (answer at the end of the post, in case someone wants the exercise of figuring it out themselves).

However, things don't need to get much more complex for a regexp to become truly write-only thing, and even with diligent documentation (which everybody of course does for all complex regexps they write…), modifying the regexps becomes a daunting task. It can be a very dangerous task too, if regexp is not carefully unit tested (but everybody of course has comprehensive unit tests for all their complex regexps, both positive and negative…).

So, long story short, is there a write-read solution/alternative for regular expressions without losing their power? How would the above regexp look like with an alternative approach? Any language is fine, though a multi-language solution would be best, to the degree regexps are multi-language.


And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Best Answer

A number of people have mentioned composing from smaller parts, but no one's provided an example yet, so here's mine:

string number = "(\\d+)";
string unit = "(?:" + number + "\\s*:\\s*)";
string optionalDecimal = "(?:\\s*[.,]\\s*" + number + ")?";

Pattern re = Pattern.compile(
  "^\\s*(?:" + unit + "?" + unit + ")?" + number + optionalDecimal + "\\s*$"
);

Not the most readable, but I feel like it's clearer than the original.

Also, C# has the @ operator which can be prepended to a string in order to indicate that it is to be taken literally (no escape characters), so number would be @"([\d]+)";