Xml – XPATH or XSL to match two node-sets using custom comparison

xmlxpathxslt

EDIT: I also have access to ESXLT functions.

I have two node sets of string tokens. One set contains values like these:

/Geography/North America/California/San Francisco
/Geography/Asia/Japan/Tokyo/Shinjuku

The other set contains values like these:

/Geography/North America/
/Geography/Asia/Japan/

My goal is to find a "match" between the two. A match is made when any string in set 1 begins with a string in set 2. For example, a match would be made between /Geography/North America/California/San Francisco and /Geography/North America/ because a string from set 1 begins with a string from set 2.

I can compare strings using wildcards by using a third-party extension. I can also use a regular expression all within an Xpath.

My problem is how do I structure the Xpath to select using a function between all nodes of both sets? XSL is also a viable option.

This XPATH:

count($set1[.=$set2])

Would yield the count of intersection between set1 and set2, but it's a 1-to-1 comparison. Is it possible to use some other means of comparing the nodes?

EDIT: I did get this working, but I am cheating by using some of the other third-party extensions to get the same result. I am still interested in other methods to get this done.

Best Answer

This:

<xsl:variable name="matches" select="$set1[starts-with(., $set2)]"/>

will set $matches to a node-set containing every node in $set1 whose text value starts with the text value of a node in $set2. That's what you're looking for, right?

Edit:

Well, I'm just wrong about this. Here's why.

starts-with expects its two arguments to both be strings. If they're not, it will convert them to strings before evaluating the function.

If you give it a node-set as one of its arguments, it uses the string value of the node-set, which is the text value of the first node in the set. So in the above, $set2 never gets searched; only the first node in the list ever gets examined, and so the predicate will only find nodes in $set1 that start with the value of the first node in $set2.

I was misled because this pattern (which I've been using a lot in the last few days) does work:

<xsl:variable name="hits" select="$set1[. = $set2]"/>

But that predicate is using an comparison between node-sets, not between text values.

The ideal way to do this would be by nesting predicates. That is, "I want to find every node in $set1 for which there's a node in $set2 whose value starts with..." and here's where XPath breaks down. Starts with what? What you'd like to write is something like:

<xsl:variable name="matches" select="$set1[$set2[starts-with(?, .)]]"/>

only there's no expression you can write for the ? that will return the node currently being tested by the outer predicate. (Unless I'm missing something blindingly obvious.)

To get what you want, you have to test each node individually:

<xsl:variable name="matches">
  <xsl:for-each select="$set1">
    <xsl:if test="$set2[starts-with(current(), .)]">
      <xsl:copy-of select="."/>
    </xsl:if>
  </xsl:for-each>
</xsl:variable>

That's not a very satisfying solution because it evaluates to a result tree fragment, not a node-set. You'll have to use an extension function (like msxsl:node-set) to convert the RTF to a node-set if you want to use the variable in an XPath expression.