The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/
namespace. To make it work, you need to change
titles = document.findall('.//title')
to
titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
The namespace can also be provided via the namespaces
parameter:
NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)
This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).
See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via ‘ElementTree’.
The trouble with iterparse()
is caused by the fact that this function provides (event, element)
tuples (not just elements). In order to get the tag name, change
for e in etree.iterparse(file_name):
print e.tag
to this:
for e in etree.iterparse(file_name):
print e[1].tag