Java regex for support Unicode?

What you are looking for are Unicode properties.

e.g. \p{L} is any kind of letter from any language

So a regex to match such a Chinese word could be something like

\p{L}+

There are many such properties, for more details see regular-expressions.info

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links

You could do something like this

Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);

and \w would match all letters and all digits from any languages (and of course some word combining characters like _).

Leave a Comment