Java regex to extract text between tags

javaregex

I have a file with some custom tags and I'd like to write a regular expression to extract the string between the tags. For example if my tag is:

[customtag]String I want to extract[/customtag]

How would I write a regular expression to extract only the string between the tags. This code seems like a step in the right direction:

Pattern p = Pattern.compile("[customtag](.+?)[/customtag]");
Matcher m = p.matcher("[customtag]String I want to extract[/customtag]");

Not sure what to do next. Any ideas? Thanks.

Best Answer

You're on the right track. Now you just need to extract the desired group, as follows:

final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

If you want to extract multiple hits, try this:

public static void main(String[] args) {
    final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";
    System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear]
}

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

private static List<String> getTagValues(final String str) {
    final List<String> tagValues = new ArrayList<String>();
    final Matcher matcher = TAG_REGEX.matcher(str);
    while (matcher.find()) {
        tagValues.add(matcher.group(1));
    }
    return tagValues;
}

However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath API for more info.