Java Unicode encoding

You can handle them all if you’re careful enough.

Java’s char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)

Leave a Comment