It is a bug in the re
module and it is fixed in the regex
module:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex # $ pip install regex
word = "किशोरी"
def test(re_):
assert re_.search("^\\w+$", word, flags=re_.UNICODE)
print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])
test(regex)
test(re) # fails
The output shows that there are 6 codepoints in "किशोरी"
, but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:
Word boundaries, line boundaries, and sentence boundaries should not
occur within a grapheme cluster: in other words, a grapheme cluster
should be an atomic unit with respect to the process of determining
these other boundaries.
here and further emphasis is mine
A word boundary \b
is defined as a transition from \w
to \W
(or in reverse) in the docs:
Note that formally, \b is defined as the boundary between a \w and a
\W character (or vice versa), or between \w and the beginning/end of
the string, …
Therefore either all codepoints that form a single character are \w
or they are all \W
.
In this case "किशोरी"
matches ^\w{6}$
.
From the docs for \w
in Python 2:
If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
in Python 3:
Matches Unicode word characters; this includes most characters that
can be part of a word in any language, as well as numbers and the
underscore.
From regex
docs:
Definition of ‘word’ character (issue #1693050):
The definition of a ‘word’ character has been expanded for Unicode. It now conforms to the Unicode specification at
http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and
\B.
According to unicode.org U+093F
(DEVANAGARI VOWEL SIGN I
) is alnum and alphabetic so regex
is also correct to consider it \w
even if we follow definitions that are not based on word boundaries.