What is the {L} Unicode category?

Taken from this link: http://www.regular-expressions.info/unicode.html Check the Unicode Character Properties section. \p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à with the accent. The reason is that … Read more

Regex and unicode

Use a subrange of [\u0000-\uFFFF] for what you want. You can also use the re.UNICODE compile flag. The docs say that if UNICODE is set, \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

Is There a Way to Match Any Unicode Alphabetic Character?

Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably \p{L} which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do \p{L}\p{M}* In any case, all the different types of character properties are detailed in the first link. … Read more

Does \w match all alphanumeric characters defined in the Unicode standard?

perldoc perlunicode says Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. So it looks like the answer to your question is “yes”. However, you might want to use the \p{} … Read more

Regex for names with special characters (Unicode)

Try the following regular expression: ^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$ In PHP this translates to: if (preg_match(‘~^(?:[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\’\x{2019}]+\s?)+$~u’, $name) > 0) { // valid } You should read it like this: ^ # start of subject (?: # match this: [ # match a: \p{L} # Unicode letter, or \p{Mn} # Unicode accents, or \p{Pd} # Unicode hyphens, or \’ … Read more

Match any unicode letter?

Python’s re module doesn’t support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too. Since \w will also match digits, you need to then subtract those from your character class, along with the underscore: [^\W\d_] will match any Unicode … Read more

matching unicode characters in python regular expressions

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix: >>> re.match(r’^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$’, u’/by_tag/påske/øyfjell.jpg’, re.UNICODE).groupdict() {‘tag’: u’p\xe5ske’, ‘filename’: u’\xf8yfjell.jpg’} This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.