Why is XPath contains(text(),’substring’) not working as expected?

For this markup,

<a>Ask Question<other/>more text</a>

notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").

Here’s how to reason through what’s happening when evaluating //a[contains(text(),'Ask Question')] against that markup:

  1. contains(x,y) expects x to be a string, but text() matches two text nodes.
  2. In XPath 1.0, the rule for converting multiple nodes to a string is this:

A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]

  1. In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.

In your case…

  • XPath 1.0 would treat contains(text(),'Ask Question') as

    contains('Ask Question','Ask Question')
    

    which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.

  • XPath 2.0 would treat it as an error.

Better alternatives

  • If the goal is to find all a elements whose string value contains the substring,
    "Ask Question":

    //a[contains(.,'Ask Question')]
    

    This is the most common requirement.

  • If the goal is to find all a elements with an immediate text node child equal to "Ask Question":

    //a[text()='Ask Question']
    

    This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,

    <a>Ask Question<other/>more text</a>
    

    but not this a:

    <a>more text before <not>Ask Question</not> more text after</a>
    

See also

Leave a Comment