Ignore html tags in preg_replace

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.

The general saying is: Don’t parse HTML with regular expressions.

It’s a good rule to keep in mind and albeit as with any rule, it does not always apply, it’s worth to make up one’s mind about it.

XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.

Then you only need to wrap those texts into the <span> and you’re done.

Edit: Finally some code 😉

First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I’m not a super xpath pro:

'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'

$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).

This query will return all parents that contain textnodes which put together will be a string that contain your search term.

As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.

This is the base skeleton of the routine:

$str="..."; # some XML

$search="text that span";

printf("Searching for: (%d) '%s'\n", strlen($search), $search);

$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);

$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
    throw new Exception('Anchor element not found.');
}

// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
    throw new Exception('XPath failed.');
}

// process search results
foreach($r as $i => $node)
{   
    $textNodes = $xp->query('.//child::text()', $node);

    // extract $search textnode ranges, create fitting nodes if necessary
    $range = new TextRange($textNodes);        
    $ranges = array();
    while(FALSE !== $start = strpos($range, $search))
    {
        $base = $range->split($start);
        $range = $base->split(strlen($search));
        $ranges[] = $base;
    };

    // wrap every each matching textnode
    foreach($ranges as $range)
    {
        foreach($range->getNodes() as $node)
        {
            $span = $doc->createElement('span');
            $span->setAttribute('class', 'search_hightlight');
            $node = $node->parentNode->replaceChild($span, $node);
            $span->appendChild($node);
        }
    }
}

For my example XML:

<html>
    <body>
        This is some <span>text</span> that span across a page to search in.
    and more text that span</body>
</html>

It produces the following result:

<html>
    <body>
        This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
    and more <span class="search_hightlight">text that span</span></body>
</html>

This shows that this even allows to find text that is distributed across multiple tags. That’s not that easily possible with regular expressions at all.

You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).

It’s not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.

A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.

The example works anyway because it’s only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.

For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:

 while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

Leave a Comment