The syntax of your unicode range will not do what you expect.
-
The raw
r''
string prevents\u
escapes from being parsed, and the regex engine will not do this. The only range in this set is[0-\]
:>>> re.compile(r'[\u0020-\u00d7ff]', re.DEBUG) in literal 117 literal 48 literal 48 literal 50 range (48, 117) literal 48 literal 48 literal 100 literal 55 literal 102 literal 102
-
Making it a Unicode literal causes
\u
parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is\uxxxx
or\Uxxxxxxxx
, so it’s parsed as “\u00d7
,f
,f
“.>>> re.compile(ur'[\u0020-\u00d7ff]', re.DEBUG) in range (32, 215) literal 102 literal 102
-
Removing the leading zeroes or switching to
\U0000d7ff
will fix it:>>> re.compile(ur'[\u0020-\ud7ff]', re.DEBUG) in range (32, 55295)