loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you’re using. Looks like it’s not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I’ve got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you’re now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str="<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>";

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

References

Leave a Comment