Python unicode regular expression matching failing with some unicode characters -bug or mistake?

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "किशोरी"


def test(re_):
    assert re_.search("^\\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])

test(regex)
test(re)  # fails

The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

Word boundaries, line boundaries, and sentence boundaries should not
occur within a grapheme cluster: in other words, a grapheme cluster
should be an atomic unit with respect to the process of determining
these other boundaries.

^{here and further emphasis is mine}

A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

Note that formally, \b is defined as the boundary between a \w and a
\W character (or vice versa), or between \w and the beginning/end of
the string, …

Therefore either all codepoints that form a single character are \w or they are all \W.
In this case "किशोरी" matches ^\w{6}$.

From the docs for \w in Python 2:

If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.

in Python 3:

Matches Unicode word characters; this includes most characters that
can be part of a word in any language, as well as numbers and the
underscore.

From regex docs:

Definition of ‘word’ character (issue #1693050):

The definition of a ‘word’ character has been expanded for Unicode. It now conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
\B.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

More Related Contents:

Leave a Comment Cancel reply