Why this regex is not working for german words?

Unicode in Javascript Regexen

Like Java itself, Javascript doesn’t support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it’s sure a big gotcha. Kinda bites, really.

The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.

It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.

This page says that Javascript doesn’t support any Unicode properties at all. That same site has a table that’s a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.

However, that table is in some cases at least five years out of date, so I can’t completely vouch for it. It’s a good start, though.

Unicode Support in Other Languages

Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.

In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].

Indeed, there’s even an advantage to writing it that way, because it keeps you aware that you’re adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.

I don’t believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.

The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.

The future JDK7 will finally get around to adding scripts. Even then Java still won’t support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.

SIGH! To understand just how limited Java’s property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007’s 5.10 release, and 2478 of them as of this year’s 5.12 release. I haven’t counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.

Lame as Java is, it’s still better than Javascript, because Javascript doesn’t support any Unicode properties whatsoCENSOREDever. I’m afraid that Javascript’s paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that’s extremely difficult to account for given its target domain.

Sorry ’bout that. ☹

Leave a Comment