Empty list returned from ElementTree findall

The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/ namespace. To make it work, you need to change

titles = document.findall('.//title')

titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')

The namespace can also be provided via the namespaces parameter:

NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)

This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).

See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via ‘ElementTree’.

The trouble with iterparse() is caused by the fact that this function provides (event, element) tuples (not just elements). In order to get the tag name, change

for e in etree.iterparse(file_name):
    print e.tag

to this:

for e in etree.iterparse(file_name):
    print e[1].tag

More Related Contents:

Leave a Comment Cancel reply