Java Regex Capturing Groups

javaregex

I am trying to understand this code block. In the first one, what is it we are looking for in the expression?

My understanding is that it is any character (0 or more times *) followed by any number between 0 and 9 (one or more times +) followed by any character (0 or more times *).

When this is executed the result is:

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0

Could someone please go through this with me?

What is the advantage of using Capturing groups?

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTut3 {

    public static void main(String args[]) {
        String line = "This order was placed for QT3000! OK?"; 
        String pattern = "(.*)(\\d+)(.*)";

        // Create a Pattern object
        Pattern r = Pattern.compile(pattern);

        // Now create matcher object.
        Matcher m = r.matcher(line);

        if (m.find()) {
            System.out.println("Found value: " + m.group(0));
            System.out.println("Found value: " + m.group(1));
            System.out.println("Found value: " + m.group(2));
        } else {
            System.out.println("NO MATCH");
        }
    }

}

Best Answer

The issue you're having is with the type of quantifier. You're using a greedy quantifier in your first group (index 1 - index 0 represents the whole Pattern), which means it'll match as much as it can (and since it's any character, it'll match as many characters as there are in order to fulfill the condition for the next groups).

In short, your 1st group .* matches anything as long as the next group \\d+ can match something (in this case, the last digit).

As per the 3rd group, it will match anything after the last digit.

If you change it to a reluctant quantifier in your 1st group, you'll get the result I suppose you are expecting, that is, the 3000 part.

Note the question mark in the 1st group.

String line = "This order was placed for QT3000! OK?";
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println("group 1: " + matcher.group(1));
    System.out.println("group 2: " + matcher.group(2));
    System.out.println("group 3: " + matcher.group(3));
}

Output:

group 1: This order was placed for QT
group 2: 3000
group 3: ! OK?

More info on Java Pattern here.

Finally, the capturing groups are delimited by round brackets, and provide a very useful way to use back-references (amongst other things), once your Pattern is matched to the input.

In Java 6 groups can only be referenced by their order (beware of nested groups and the subtlety of ordering).

In Java 7 it's much easier, as you can use named groups.