lxml - w3toppers.com

lxml etree xmlparser remove unwanted namespace

import io import lxml.etree as ET content=””‘\ <Envelope xmlns=”http://www.example.com/zzz/yyy”> <Header> <Version>1</Version> </Header> <Body> some stuff </Body> </Envelope> ”’ dom = ET.parse(io.BytesIO(content)) You can find namespace-aware nodes using the xpath method: body=dom.xpath(‘//ns:Body’,namespaces={‘ns’:’http://www.example.com/zzz/yyy’}) print(body) # [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>] If you really want to remove namespaces, you could use an XSL transformation: # http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl xslt=””‘<xsl:stylesheet version=”1.0″ xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”> … Read more

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

You are using the decoded unicode value. Use r.raw raw response data instead: r = requests.get(url, params=payload, stream=True) r.raw.decode_content = True etree.parse(r.raw) which will read the data from the response directly; do note the stream=True option to .get(). Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content … Read more

Parse SGML with Open Arbitrary Tags in Python 3

If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML. Here is a simple example. Let’s say that we have the following SGML document (company.sgml; with a root … Read more

How to use regular expression in lxml xpath?

You can do this (although you don’t need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method) doc.xpath(“//a[re:match(text(), ‘some text’)]”, namespaces={“re”: “http://exslt.org/regular-expressions”}) Note that you need to give the namespace mapping, so that it … Read more

parsing xml containing default namespace to get an element value using lxml

This is a common error when dealing with XML having default namespace. Your XML has default namespace, a namespace declared without prefix, here : <sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″> Note that not only element where default namespace declared is in that namespace, but all descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix … Read more

SyntaxError of Non-ASCII character [duplicate]

You should define source code encoding, add this to the top of your script: # -*- coding: utf-8 -*- The reason why it works differently in console and in the IDE is, likely, because of different default encodings set. You can check it by running: import sys print sys.getdefaultencoding() Also see: Why declare unicode by … Read more

How do I use a default namespace in an lxml xpath query?

Something like this should work: import lxml.etree as et ns = {“atom”: “http://www.w3.org/2005/Atom”} tree = et.fromstring(xml) for node in tree.xpath(‘//atom:entry’, namespaces=ns): print node See also http://lxml.de/xpathxslt.html#namespaces-and-prefixes. Alternative: for node in tree.xpath(“//*[local-name() = ‘entry’]”): print node

Python pretty XML printer with lxml

For me, this issue was not solved until I noticed this little tidbit here: http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output Short version: Read in the file with this command: >>> parser = etree.XMLParser(remove_blank_text=True) >>> tree = etree.parse(filename, parser) That will “reset” the already existing indentation, allowing the output to generate it’s own indentation correctly. Then pretty_print as normal: >>> tree.write(<output_file_name>, … Read more

How to get path of an element in lxml?

Use getpath from ElementTree objects. from lxml import etree root = etree.fromstring(”’ <foo><bar>Data</bar><bar><baz>data</baz> <baz>data</baz></bar></foo> ”’) tree = etree.ElementTree(root) for e in root.iter(): print(tree.getpath(e)) Prints /foo /foo/bar[1] /foo/bar[2] /foo/bar[2]/baz[1] /foo/bar[2]/baz[2]

Building lxml for Python 2.7 on Windows

I bet you’re not using VS 2008 for this 🙂 There’s def find_vcvarsall(version): function (guess what, it looks for vcvarsall.bat) in distutils with the following comment At first it tries to find the productdir of VS 2008 in the registry. If that fails it falls back to the VS90COMNTOOLS env var. If you’re not using … Read more