lxml etree xmlparser remove unwanted namespace

import io import lxml.etree as ET content=””‘\ <Envelope xmlns=”http://www.example.com/zzz/yyy”> <Header> <Version>1</Version> </Header> <Body> some stuff </Body> </Envelope> ”’ dom = ET.parse(io.BytesIO(content)) You can find namespace-aware nodes using the xpath method: body=dom.xpath(‘//ns:Body’,namespaces={‘ns’:’http://www.example.com/zzz/yyy’}) print(body) # [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>] If you really want to remove namespaces, you could use an XSL transformation: # http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl xslt=””‘<xsl:stylesheet version=”1.0″ xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”> … Read more

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

You are using the decoded unicode value. Use r.raw raw response data instead: r = requests.get(url, params=payload, stream=True) r.raw.decode_content = True etree.parse(r.raw) which will read the data from the response directly; do note the stream=True option to .get(). Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content … Read more

How to use regular expression in lxml xpath?

You can do this (although you don’t need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method) doc.xpath(“//a[re:match(text(), ‘some text’)]”, namespaces={“re”: “http://exslt.org/regular-expressions”}) Note that you need to give the namespace mapping, so that it … Read more

parsing xml containing default namespace to get an element value using lxml

This is a common error when dealing with XML having default namespace. Your XML has default namespace, a namespace declared without prefix, here : <sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> Note that not only element where default namespace declared is in that namespace, but all descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix … Read more

How do I use a default namespace in an lxml xpath query?

Something like this should work: import lxml.etree as et ns = {“atom”: “http://www.w3.org/2005/Atom”} tree = et.fromstring(xml) for node in tree.xpath(‘//atom:entry’, namespaces=ns): print node See also http://lxml.de/xpathxslt.html#namespaces-and-prefixes. Alternative: for node in tree.xpath(“//*[local-name() = ‘entry’]”): print node

Python pretty XML printer with lxml

For me, this issue was not solved until I noticed this little tidbit here: http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Short version: Read in the file with this command: >>> parser = etree.XMLParser(remove_blank_text=True) >>> tree = etree.parse(filename, parser) That will “reset” the already existing indentation, allowing the output to generate it’s own indentation correctly. Then pretty_print as normal: >>> tree.write(<output_file_name>, … Read more

How to get path of an element in lxml?

Use getpath from ElementTree objects. from lxml import etree root = etree.fromstring(”’ <foo><bar>Data</bar><bar><baz>data</baz> <baz>data</baz></bar></foo> ”’) tree = etree.ElementTree(root) for e in root.iter(): print(tree.getpath(e)) Prints /foo /foo/bar[1] /foo/bar[2] /foo/bar[2]/baz[1] /foo/bar[2]/baz[2]