javascript and string manipulation w/ utf-16 surrogate pairs

Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so.

As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit.

Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. Javascript isn’t good enough for that as you have discovered.

It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. I talk about all this in OSCON talk, ๐Ÿ”ซ Unicode Support Shootout: ๐Ÿ‘ The Good, the Bad, & the (mostly) Ugly ๐Ÿ‘Ž.

Due to its horrible Curse, you have to hand-simulate UTF-16 with UCS-2 in Javascript, which is simply nuts.

Javascript suffers from all kinds of other terrible Unicode troubles, too. It has no support for graphemes or normalization or collation, all of which you really need. And its regexes are broken, sometimes due to the Curse, sometimes just because people got it wrong. For example, Javascript is incapable of expressing regexes like [๐’œ-๐’ต]. Javascript doesnโ€™t even support casefolding, so you canโ€™t write a pattern like /ฮฃฮคฮ™ฮ“ฮœฮ‘ฮฃ/i and have it correctly match ฯƒฯ„ฮนฮณฮผฮฑฯ‚.

You can try to use the XRegEXp plugin, but you wonโ€™t banish the Curse that way. Only changing to a language with Unicode support will do that, and ๐’ฅ๐’ถ๐“‹๐’ถ๐“ˆ๐’ธ๐“‡๐’พ๐“…๐“‰ just isnโ€™t one of those.

Leave a Comment