Java – Best practices for regex performance VS sheer iteration

big ojavaperformanceregex

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?

While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?

My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.

Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.

Best Answer

The answer (as usual) is that it depends.

In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):

for( i = 0; i < stringLength; i++ ) {
    if( stringAt pos i starts with "this" )
       found!
    if( stringAt pos i starts with "that" )
       found!
}

There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.

On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.

To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.

And then you won't be so worried about a few cycles here and there.