Parsing of html string using jquery

None of the current answers addressed the real issue, so I’ll give it a go.

var datahtml = "<html><body><div class=\"class0\"><h4>data1</h4><p class=\"class1\">data2</p><div id=\"mydivid\"><p>data3</p></div></div></body></html>";

console.log($(datahtml));

$(datahtml) is a jQuery object containing only the div.class0 element, thus when you call .find on it, you’re actually looking for descendants of div.class0 instead of the whole HTML document that you’d expect.

A quick solution is to wrap the parsed data in an element so the .find will work as intended:

var parsed = $('<div/>').append(datahtml);
console.log(parsed.find(".class0").text());

Fiddle


The reason for this isn’t very simple, but I assume that as jQuery does “parsing” of more complex html strings by simply dropping your HTML string into a separate created-on-the-fly DOM fragment and then retrieves the parsed elements, this operation would most likely make the DOM parser ignore the html and body tags as they would be illegal in this case.

Here is a very small test suite which demonstrates that this behavior is consistent through jQuery 1.8.2 all the way down to 1.6.4.

Edit: quoting this post:

Problem is that jQuery creates a DIV and sets innerHTML and then takes
DIV children, but since BODY and HEAD elements are not valid DIV
childs, then those are not created by browser.

Makes me more confident that my theory is correct. I’ll share it here, hopefully it makes some sense for you. Have the jQuery 1.8.2’s uncompressed source side by side with this. The # indicates line numbers.

All document fragments made through jQuery.buildFragment (defined @#6122) will go through jQuery.clean (#6151) (even if it is a cached fragment, it already went through the jQuery.clean when it was created), and as the quoted text above implies, jQuery.clean (defined @#6275) creates a fresh div inside the safe fragment to serve as container for the parsed data – div element created at #6301-6303, childNodes retrieved at #6344, div removed at #6347 for cleaning up (plus #6359-6361 as bug fix), childNodes merged into the return array at #6351-6355 and returned at #6406.

Therefore, all methods that invoke jQuery.buildFragment, which include jQuery.parseHTML and jQuery.fn.domManip – among those are .append(), .after(), .before() which invoke the domManip jQuery object method, and the $(html) which is handled at jQuery.fn.init (defined @#97, handling of complex [more than a single tag] html strings @#125, invokes jQuery.parseHTML @#131).

It makes sense that virtually all jQuery HTML strings parsing (besides single tag html strings) is done using a div element as container, and html/body tags are not valid descendants of a div element so they are stripped out.


Addendum: Newer versions of jQuery (1.9+) have refactored the HTML parsing logic (for instance, the internal jQuery.clean method no longer exists), but the overall parsing logic remains the same.

Leave a Comment