C# and UTF-16 characters

The string class represents a UTF-16 encoded block of text, and each char in a string represents a UTF-16 code value.

Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a string and an index instead of just a char. For example, the static GetUnicodeCategory(char) method on the System.Globalization.CharUnicodeInfo class has a corresponding GetUnicodeCategory(string,int) method that will recognize a simple character or a surrogate pair starting at the specified index.

To iterate through the text elements in a string, you can use the methods on the System.Globalization.StringInfo class. Here, a “text element” corresponds to a single character as displayed on screen. This means that simple characters ("a"), combining characters ("a\u0304\u0308" = “ā̈”), and surrogate pairs ("\uD950\uDF21" = “”) will all be treated as a single text element.

Specifically, the GetTextElementEnumerator static method will allow you to enumerate over each text element in a string (see the linked MSDN page for a code example).

More Related Contents:

Leave a Comment Cancel reply