Why is ElementTree raising a ParseError?

Here are some ideas:

(0) Explain “a file” and “occasionally”: do you really mean it works sometimes and fails sometimes with the same file?

Do the following for each failing file:

(1) Find out what is in the file at the point that it is complaining about:

text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration

(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/

and edit your question to display your findings.

Update: Here is the minimal xml file that illustrates your problem:

[badcharref.xml]
<a>&#1;</a>

[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
...     print el.tag
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>

Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.

You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

… or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;‘, &#not-a-digit etc etc

Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

Output:

comments.xml
6615405 &#x10;
10205764 &#x00;
10213901 &#x00;
10213936 &#x00;
10214123 &#x00;
13292514 &#x03;
...
155656543 &#x1B;
155656564 &#x1B;
157344876 &#x10;
157722583 &#x10;

posts.xml
7607143 &#x1F;
12982273 &#x1B;
12982282 &#x1B;
12982292 &#x1B;
12982302 &#x1B;
12982310 &#x1B;
16085949 &#x1C;
16085955 &#x1C;
...
36303479 &#x12;
36303494 &#xFFFF; <<=== whoops
38942863 &#x10;
...
785292911 &#x08;
801282472 &#x13;
848911592 &#x0B;

More Related Contents:

Leave a Comment Cancel reply