Parsing broken XML with lxml.etree.iterparse

Edit:

This is an older answer and I would have done it differently today. And I’m not just referring to the dumb snark … since then BeutifulSoup4 is available and it’s really quite nice. I recommend that to anyone who stumbles over here.


The currently accepted answer is, well, not what one should do.
The question itself also has a bad assumption:

parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.

Actually recover=True is for recovering from misformed XML. There is however an “encoding” option which would have fixed your issue.

parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
                              recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
                              )

That’s it, that’s the solution.


BTW — For anyone struggling with parsing XML in python, especially from third party sources. I know, I know, the documentation is bad and there are a lot of SO red herrings; a lot of bad advice.

  • lxml.etree.fromstring()? – That’s for perfectly formed XML, silly
  • BeautifulStoneSoup? – Slow, and has a way-stupid policy for self
    closing tags
  • lxml.etree.HTMLParser()? – (because the xml is broken)
    Here’s a secret – HTMLParser() is… a Parser with recover=True
  • lxml.html.soupparser? – The encoding detection is supposed to be better, but it has the same failings of BeautifulSoup for self closing tags. Perhaps you can combine XMLParser with BeautifulSoup’s UnicodeDammit
  • UnicodeDammit and other cockamamie stuff to fix encodings? – Well, UnicodeDammit is kind of cute, I like the name and it’s useful for stuff beyond xml, but things are usually fixed if you do the right thing with XMLParser()

You could be trying all sorts of stuff from what’s available online. lxml documentation could be better. The code above is what you need for 90% of your XML parsing cases. Here I’ll restate it:

magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

You’re welcome. My headaches == your sanity. Plus it has other features you might need for, you know, XML.

Leave a Comment