How to replace text URLs and exclude URLs in HTML tags?

Streamlined version of Gumbo’s above:

$html = <<< HTML
<html>
<body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;

Let’s use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);

The XPath above will give us a TextNode with the following data:

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like 

Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.

Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '<a href="https://stackoverflow.com/questions/4003031/">$1</a>',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);

and this will then output:

<html><body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another <a href="http://example.com">http://example.com</a> with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body></html>

Leave a Comment