lxml - w3toppers.com

builtins.TypeError: must be str, not bytes

The outfile should be in binary mode. outFile = open(‘output.xml’, ‘wb’)

pip is not able to install packages correctly: Permission denied error [duplicate]

It looks like you’re having a permissions error, based on this message in your output: error: could not create ‘/lib/python2.7/site-packages/lxml’: Permission denied. One thing you can try is doing a user install of the package with pip install lxml –user. For more information on how that works, check out this StackOverflow answer. (Thanks to Ishaan … Read more

using lxml and iterparse() to parse a big (+- 1Gb) XML file

for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): for child in element: print(child.tag, child.text) element.clear() the final clear will stop you from using too much memory. [update:] to get “everything between … as a string” i guess you want one of: for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): print(etree.tostring(element)) element.clear() or for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): print(”.join([etree.tostring(child) … Read more

Remove namespace and prefix from xml in python using lxml

We can get the desired output document in two steps: Remove namespace URIs from element names Remove unused namespace declarations from the XML tree Example code from lxml import etree input_xml = “”” <package xmlns=”http://apple.com/itunes/importer”> <provider>some data</provider> <language>en-GB</language> <!– some comment –> <?xml-some-processing-instruction ?> </package> “”” root = etree.fromstring(input_xml) # Iterate through all XML elements … Read more

bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a parser library?

I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you’re like me (on OSX) you might be stuck with something that requires a bit of work: You’ll notice that in the BS4 documentation page above, they point out that … Read more

How to select following sibling/XML tag using XPath

How would I accomplish the nextsibling and is there an easier way of doing this? You may use: tr/td[@class=”name”]/following-sibling::td but I’d rather use directly: tr[td[@class=”name”] =’Brand’]/td[@class=”desc”] This assumes that: The context node, against which the XPath expression is evaluated is the parent of all tr elements — not shown in your question. Each tr element … Read more

How can this function be rewritten to implement OrderedDict? [duplicate]

You could use the new OrderedDictdict subclass which was added to the standard library’s collections module in version 2.7✶. Actually what you need is an Ordered+defaultdict combination which doesn’t exist — but it’s possible to create one by subclassing OrderedDict as illustrated below: ✶ If your version of Python doesn’t have OrderedDict, you should be … Read more

Get all text inside a tag in lxml

Just use the node.itertext() method, as in: ”.join(node.itertext())

Parsing HTML in python – lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

Pyquery provides the jQuery selector interface to Python (using lxml under the hood). http://pypi.python.org/pypi/pyquery It’s really awesome, I don’t use anything else anymore.

how to remove an element in lxml

Use the remove method of an xmlElement : tree=et.fromstring(xml) for bad in tree.xpath(“//fruit[@state=\’rotten\’]”): bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it print et.tostring(tree, pretty_print=True, xml_declaration=True) If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly … Read more