builtins.TypeError: must be str, not bytes
The outfile should be in binary mode. outFile = open(‘output.xml’, ‘wb’)
The outfile should be in binary mode. outFile = open(‘output.xml’, ‘wb’)
It looks like you’re having a permissions error, based on this message in your output: error: could not create ‘/lib/python2.7/site-packages/lxml’: Permission denied. One thing you can try is doing a user install of the package with pip install lxml –user. For more information on how that works, check out this StackOverflow answer. (Thanks to Ishaan … Read more
for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): for child in element: print(child.tag, child.text) element.clear() the final clear will stop you from using too much memory. [update:] to get “everything between … as a string” i guess you want one of: for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): print(etree.tostring(element)) element.clear() or for event, element in etree.iterparse(path_to_file, tag=”BlogPost”): print(”.join([etree.tostring(child) … Read more
We can get the desired output document in two steps: Remove namespace URIs from element names Remove unused namespace declarations from the XML tree Example code from lxml import etree input_xml = “”” <package xmlns=”http://apple.com/itunes/importer”> <provider>some data</provider> <language>en-GB</language> <!– some comment –> <?xml-some-processing-instruction ?> </package> “”” root = etree.fromstring(input_xml) # Iterate through all XML elements … Read more
I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you’re like me (on OSX) you might be stuck with something that requires a bit of work: You’ll notice that in the BS4 documentation page above, they point out that … Read more
How would I accomplish the nextsibling and is there an easier way of doing this? You may use: tr/td[@class=”name”]/following-sibling::td but I’d rather use directly: tr[td[@class=”name”] =’Brand’]/td[@class=”desc”] This assumes that: The context node, against which the XPath expression is evaluated is the parent of all tr elements — not shown in your question. Each tr element … Read more
You could use the new OrderedDictdict subclass which was added to the standard library’s collections module in version 2.7✶. Actually what you need is an Ordered+defaultdict combination which doesn’t exist — but it’s possible to create one by subclassing OrderedDict as illustrated below: ✶ If your version of Python doesn’t have OrderedDict, you should be … Read more
Just use the node.itertext() method, as in: ”.join(node.itertext())
Pyquery provides the jQuery selector interface to Python (using lxml under the hood). http://pypi.python.org/pypi/pyquery It’s really awesome, I don’t use anything else anymore.
Use the remove method of an xmlElement : tree=et.fromstring(xml) for bad in tree.xpath(“//fruit[@state=\’rotten\’]”): bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it print et.tostring(tree, pretty_print=True, xml_declaration=True) If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly … Read more